Page 1 of 2 12 LastLast
Results 1 to 30 of 43

Thread: Hutter Prize update

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts

    Hutter Prize update

    The Hutter prize has been updated by 10x in prize money (5000 euros per 1% improvement), size, and CPU time and memory allowed. http://prize.hutter1.net/

    The task is now to compress (instead of decompress) enwik9 to a self extracting archive in 100 hours with 10 MB RAM allowed. Submissions must be OSI licensed. The baseline is phda9 (which meets time and memory but not other requirements). The compressor size is added to the archive size. Contest starts April 14 2020 with awards thereafter for improvements of 1% or better.

  2. Thanks (15):

    bwt (22nd February 2020),byronknoll (22nd February 2020),compgt (23rd February 2020),CompressMaster (22nd February 2020),Darek (22nd February 2020),Hakan Abbas (22nd February 2020),JamesB (26th February 2020),kaitz (21st February 2020),Mike (22nd February 2020),RichSelian (23rd February 2020),schnaader (21st February 2020),Self_Recursive_Data (22nd February 2020),Shelwien (22nd February 2020),Sportman (21st February 2020),xinix (23rd February 2020)

  3. #2
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    903
    Thanks
    84
    Thanked 329 Times in 230 Posts
    Quote Originally Posted by Matt Mahoney View Post
    10 MB RAM allowed.
    Is the test machine link right, it show only 3816 MB? I assume 10MB RAM must be 10GB?

  4. #3
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,027
    Thanks
    622
    Thanked 418 Times in 316 Posts
    Yes, It's 10GB.
    "Restrictions: Must run in ≲100 hours using a single CPU core and <10GB RAM and <100GB HDD on our test machine."

  5. #4
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    185
    Thanks
    19
    Thanked 17 Times in 15 Posts
    From LTCb site , I guess the winner is phda9 again.

  6. #5
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    242
    Thanks
    42
    Thanked 99 Times in 51 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Contest starts April 14 2020 with awards thereafter for improvements of 1% or better.
    Leonardo da Vinci's birthday! good choice!

    The front page of LTCB still says "50,000 euros of funding".

    From the FAQ page:

    > Unfortunately the author of phd9 has not released the source code.

    Except the enwik-specific transforms, which reduced the effective size of input (the DRT-transformed enwik8) by 8.36%.
    Besides, the dictionary from phda9 is now being used in cmix (and maybe in some of the latest paq8 derivatives?)
    I also shared ideas, but there were no contributions from others, and almost no comments. Seemingly no one believes
    that simple things like a well-thought-out reordering of wiki articles can improve result by a percent or two (-:

    > the first winner, a Russian who always had to cycle 8km to a friend to test his code because he did not even have a suitable computer

    I guess this legend is based on these words:
    "I still don't have access to a PC with 1 Gb or more, have found only 512 Mb computer, but it's about 7 or 8 km
    away from my home (~20 minutes on the bicycle I use), thus I usually test on a 5 Mb stub of enwik8".

    Always had to cycle? That's too much of an exaggeration!
    I still cycle 3..5 hours per week when staying in Kitchener-Waterloo (9..10 months of the previous 12 months)
    attend a swimming pool when there's enough free time,
    and do lots of pull-ups outdoors when weather permits, simply because I enjoy these activities.
    By the way, even though I was born in Siberia, my ethnicity is almost 100% Ukrainian,
    guess it would be better to call me "a Ukrainian". My first flight to Canada in May 2006 was from Kiev,
    because I lived there then, because all of my relatives, except parents, reside in Ukraine.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  7. Thanks (3):

    JamesB (24th February 2020),Matt Mahoney (23rd February 2020),Self_Recursive_Data (22nd February 2020)

  8. #6
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    185
    Thanks
    19
    Thanked 17 Times in 15 Posts
    Maybe it is more interesting to use enwik10 rather then enwik9...and the time <=48hours

  9. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    So, the target compressed size of enwik9 to get the prize is ~115,300,000 (115,506,944 with decoder).

    Size-wise, it may be barely reachable for cmix (v18 result is 115,714,367), but even that v18 needs 3x allowed memory and 2x time
    (we can improve compression using public preprocessing scripts and save some memory and time by discarding non-text models,
    but 3x difference in memory size is too big).

    So once again either Alex wins first prize, then other people can start doing something (since open source is required),
    or contest keeps being stuck as before.

  10. #8
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,027
    Thanks
    622
    Thanked 418 Times in 316 Posts
    paq8pxd reaches 125-126'000'000 bytes now at -s15 option (32GB), but there no big difference for -s11 or -s12 options which consumes about 9-10GB. Also test time is about 18-32h depend on preprocessing.

    From other side if we are talking about preprocessing -> for my tests split of enwik9 file to 4 parts and merge it in "1423" order gives about 100-200kB of gain. Resplitting batch have about 8KB.

  11. #9
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    134
    Thanks
    11
    Thanked 2 Times in 2 Posts
    Matt says "The task is now to compress (instead of decompress) enwik9 to a self extracting archive"

    Why the change to compress it and not decompress? Shouldn't the time measured measure the total time to compress+decompress? Strong AI would cycle/recurse through finding and digesting new information, then extracting new insights, repeat. Kennon's algorithm for example compresses very slowly but extracts super fast.

  12. #10
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    903
    Thanks
    84
    Thanked 329 Times in 230 Posts
    Quote Originally Posted by Shelwien View Post
    So once again either Alex wins first prize
    In theory everybody can win, there is almost 2 months time to create something (contest starts April 14, 2020). Spread the Hutter prize message/hostname so more people go to try something, the prize money is serious enough to spend some time on it.

  13. #11
    Member
    Join Date
    Apr 2018
    Location
    Indonesia
    Posts
    74
    Thanks
    15
    Thanked 5 Times in 5 Posts
    it is nice

  14. #12
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    593
    Thanks
    233
    Thanked 226 Times in 107 Posts
    Since it is open source now and the rules allow multiple authors (when they agree on how to divide the prize money), I'd suggest teamwork. This would prevent one author sending in an entry and another applying his prepared transformations on it, and it has a higher chance to get some big improve on ratio. Also, enwik9 is a target big enough for multiple people to test optimizations.

    And it matches the encode.(r/s)u/paq/cmix spirit

    Of course it's a bit risky, too, because someone might steal the ideas and make his own entry out of it, but it's unlikely that he'll win that way as the original source of the ideas is known.
    http://schnaader.info
    Damn kids. They're all alike.

  15. Thanks (2):

    bwt (22nd February 2020),Darek (22nd February 2020)

  16. #13
    Member
    Join Date
    Apr 2018
    Location
    Indonesia
    Posts
    74
    Thanks
    15
    Thanked 5 Times in 5 Posts
    Quote Originally Posted by schnaader View Post
    Since it is open source now and the rules allow multiple authors (when they agree on how to divide the prize money), I'd suggest teamwork. This would prevent one author sending in an entry and another applying his prepared transformations on it, and it has a higher chance to get some big improve on ratio. Also, enwik9 is a target big enough for multiple people to test optimizations.

    And it matches the encode.(r/s)u/paq/cmix spirit

    Of course it's a bit risky, too, because someone might steal the ideas and make his own entry out of it, but it's unlikely that he'll win that way as the original source of the ideas is known.
    if i am not wrong there is marcio pais n mauro vessozi beside byron knoll that improve cmix compression ratio.

  17. #14
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    242
    Thanks
    42
    Thanked 99 Times in 51 Posts
    Quote Originally Posted by Shelwien View Post
    So once again either Alex wins first prize, then other people can start doing something (since open source is required), or contest keeps being stuck as before.
    So as usual you seemingly believe that (1) completely new approaches won't be competitive, and (2) no big improvement is possible within the established framework.

    I'm sure an accelerated version of cmix can win, because

    (1) Many people on this forum are able to accelerate cmix (and neither myself nor phda9 derivatives will compete this year)

    (2) Using the existing model mixing framework it's possible to create a compressor that
    will use 10 GB RAM, and less than 100 hours to compress,
    but decompressor will need either 100+ GB RAM, or 1000+ hours to decompress.

    Because in the compressor, each model (or a set of models)
    can provide probability of 1 for every bit of input
    independently and therefore may use all 10 GB of RAM,
    and the allowed HDD space encourages 4+ such models/sets:
    with a 32-bit floating point number per bit of input, less than
    24 GB per model/set, assuming the transformed input is smaller than 0.75 GB.

    But in the decompressor you need probabilities from all the models for decompressing every bit of the original data.
    I guess this asymmetry has been discussed on this forum a few years ago.

    As you see in the 1st post,
    "The task is now to compress (instead of decompress) enwik9
    to a self extracting archive in 100 hours" with 10 GB RAM allowed.
    Last edited by Alexander Rhatushnyak; 22nd February 2020 at 19:14. Reason: "with a 32-bit floating point number per bit of input, less than 24 GB per model/set assuming the transformed input is..."

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  18. Thanks (2):

    schnaader (22nd February 2020),Shelwien (22nd February 2020)

  19. #15
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,487
    Thanks
    26
    Thanked 129 Times in 99 Posts
    What are the limits for decompressor then? Correctness has to be verified somehow.

  20. #16
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    243
    Thanks
    112
    Thanked 114 Times in 69 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    decompressor will need either 100+ GB RAM, or 1000+ hours to decompress.
    From my reading of the rules this would not be allowed. "Each program must run on 1 core in less than 100 hours on our test machines with 10GB RAM and 100GB free HD for temporary files. No GPU usage."
    "Each program" here I assume applies to both the compression and decompression program.

  21. #17
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    242
    Thanks
    42
    Thanked 99 Times in 51 Posts
    Quote Originally Posted by Shelwien View Post
    115,506,944 with decoder.
    The number in the yellow table on the front page is 115'518'496.
    With the size of compressor, according to the new rules:
    length(comp9.exe/zip)+length(archive9.exe)
    and the size of decompressor is included with length(archive9.exe).

    Quote Originally Posted by byronknoll View Post
    "Each program" here I assume applies to both the compression and decompression program.
    If this is true, then very likely the rules will be adjusted to reflect this.
    Last edited by Alexander Rhatushnyak; 22nd February 2020 at 20:36. Reason: and the size of decompressor is included with length(archive9.exe)

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  22. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    > neither myself nor phda9 derivatives will compete this year

    Well, thanks for this statement - it might motivate some people to actually participate.

    > So as usual you seemingly believe that
    > (1) completely new approaches won't be competitive, and

    I have one such a "new approach" myself - I still believe that contrepl can be
    used for this, its just a matter of writing and testing a script for it
    (which is very time-consuming in enwik9 case).

    But its still not really profitable - with 10x prize I'd estimate
    it as $20 per hour.
    Also note that compressing enwik9 with cmix once on AWS (c5.4xlarge/linux/spot)
    would cost something like 0.374*150=$56... basically to even make a profit at all
    one needs to have access to lots of free computing resources.

    Otherwise, NNCP and durilica are still blocked by speed and memory limits,
    so its very hard for something totally unrelated to paq to win.

    And its not like enwik9 is a completely new target which nobody tried to compress before.

    > (2) no big improvement is possible within the established framework.

    Its quite possible - for example automated parameter tuning should have a significant effect.
    Also parsing optimization, speculative probability estimation (eg. we can compute
    a byte probability distribution with bitwise models, just need a way to undo their updates),
    context generation and right contexts, etc.
    There're lots of ideas really.

    But paq requires a lot of rather dumb refactoring work to become compatible with new features.
    And the developer would need a lot of computing resources for testing it.

    And I don't think that it would be fair for eg. me to tweak cmix and get a prize -
    while cmix actually has an author who invested a lot of time in its development.

    > I'm sure an accelerated version of cmix can win, because

    I also think that it can win at least the first time, if you don't participate.

    Its more about who has the computing resources to tweak its memory usage.

    > But in the decompressor you need probabilities from all the models
    > for decompressing every bit of the original data.

    Well, your original idea doesn't seem to be compatible with the rules,
    but it may be possible to use the "manual swapping" idea which
    I thought your phda used.

    In any case, to even start anything, we have to first get compression ratio and speed
    within limits.

  23. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    > The number in the yellow table on the front page is 115'518'496.

    Award = 500000*(L-S)/L, where
    S = new record (size of comp9.exe+archive9.exe+opt or alternative above),
    L = 116'673'681
    Minimum award is 1% of 500000.

    500000*(1-115506944/116673681) = 5000.00081
    500000*(1-115518496/116673681) = 4950.49522

    I guess Matt has a buggy calculator.

    > With the size of compressor, according to the new rules:
    > length(comp9.exe/zip)+length(archive9.exe)

    Yes, but its dumb and I hope it would be changed to only include the sfx size like before.
    Now it counts the decoder twice for no reason.

    But if its not changed, I guess the actual target is 115,100,000.

    > If this is true, then very likely the rules will be adjusted to reflect this.

    I don't think its a good idea to encourage this trick.

    Also I suspect that Matt simply won't be able to test entries
    with 100GB memory usage and 100+1000 hours runtime (its 45 days).

  24. Thanks:

    CompressMaster (22nd February 2020)

  25. #20
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    184
    Thanks
    49
    Thanked 13 Times in 13 Posts
    so, enwik8 is now without any prize like enwik9 was prior?
    Oh, it´s *good* that I haven´t developed my custom compressor targeted to enwik8 so far... but I´m working on it!

  26. #21
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    593
    Thanks
    233
    Thanked 226 Times in 107 Posts
    Quote Originally Posted by Shelwien View Post
    Also note that compressing enwik9 with cmix once on AWS (c5.4xlarge/linux/spot)
    would cost something like 0.374*150=$56... basically to even make a profit at all
    one needs to have access to lots of free computing resources.
    Hetzner has better pricing, 0.006 €/h or 36 €/month (CX51, 32 GB RAM, 240 GB disk). There's a "dedicated vCPU" alternative for 83 €/month, but I couldn't see a big performance difference last time I tried.

    Apart from that, Byron's offer for Google credit might still be available.
    http://schnaader.info
    Damn kids. They're all alike.

  27. Thanks:

    Shelwien (22nd February 2020)

  28. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    I actually can run 3-4 instances of cmix at home. I just wanted to point that running one costs considerable money,
    it'd be $5 per run on my home PC (which is probably much faster than hetzner VM) just in electricity bills.

    Just that some optimizations like what Alex suggests (optimizing article reordering etc) would really benefit
    from having access to 100s of free instances.

    Anyway, Intel devcloud would be likely a better choice atm, since they provide a few months of free trial...
    but I think they don't allow a single task to run for more than 24 hours, so it would be necessary to implement
    some kind of save/load feature.

    Btw https://encode.su/threads/3242-googl...ession-testing

  29. #23
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    184
    Thanks
    49
    Thanked 13 Times in 13 Posts
    @Shelwien,
    could I request you for upload enwik9 compressed alongside with decompressor (mcm 0.84 with options -x3 and -x10) as you did with enwik10 in MEGA cloud?
    As always, I´m stuck with low HDD space...
    Thank you very much.

  30. #24
    Member
    Join Date
    Jan 2020
    Location
    Canada
    Posts
    134
    Thanks
    11
    Thanked 2 Times in 2 Posts
    Can someone answer my post #9 above ?

  31. #25
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    903
    Thanks
    84
    Thanked 329 Times in 230 Posts
    What about building your own PC:

    Code:
    Case: Cooler Master MasterBox Q300L (micro-ATX),        45 euro
    Power supply: Seasonic Focus Gold 450W (80 Plus Gold),  60 euro
    Motherboard: Gigabyte B450M DS3H (micro-ATX),           70 euro
    CPU: AMD Ryzen 7 1700 + cooler (8 cores, 3.0-3.7GHz),  130 euro
    Memory: Crucial Ballistix Sport LT 32GB (2666MHz),     115 euro
    Storage: Adata XPG SX6000 Lite 512GB (NVMe M.2),        70 euro
    
    Total:,                                                490 euro
    
    Costs excl. energy 1 year:,                            41 euro p/m 
    Costs excl. energy 2 years:,                           20 euro p/m
    Costs excl. energy 3 years:,                           14 euro p/m
    Grabbed the parts, did not check if they match all.

  32. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    @Self_Recursive_Data:
    > Matt says "The task is now to compress (instead of decompress) enwik9 to a self extracting archive"
    > Why the change to compress it and not decompress?

    I think its an attempt to buy phda sources from Alex (and then to prevent anybody else monopolizing the contest),
    since they require compressor's sources now.

    It was a decompressor before, but with only a decompressor source it might be still hard to reproduce
    the result if algorithm is asymmetric (ie compressor does some data optimization, then encodes the results)
    which might be the case for phda.

    Also it looks to me that some zpaq ideas affected the new rules:
    I think Matt expects the compressor to generate a custom decompressor based on file analysis,
    that's probably why both compressor and decompressor size are counted as part of result.

    Not sure why he decided to not support the more common case where enc/dec are symmetric,
    maybe its an attempt to promote asymmetry?

    > Shouldn't the time measured measure the total time to compress+decompress?

    The time doesn't affect the results.
    Based on "Each program must run on 1 core in less than 100 hours"
    we can say that "compress+decompress" is allowed 200 hours.

    > Strong AI would cycle/recurse through finding and digesting new information,
    > then extracting new insights, repeat.
    > Kennon's algorithm for example compresses very slowly but extracts super fast.

    Sure, but 100 hours is more than 4 days.
    It should be enough time to do multiple passes or whatever, if necessary.

    Matt doesn't have a dedicated server farm for testing contest entries,
    so they can't really run for too long.

  33. Thanks:

    Self_Recursive_Data (23rd February 2020)

  34. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    > What about building your own PC

    Keep in mind that a single run takes 5-7 days... I suppose it would be better to install 256GB of RAM (so 490+345=835?),
    then run 8 instances at once.

    Actually for some tasks, like submodel memory usage tweaking, or submodel contribution estimation it would be better
    to run individual submodels, write their predictions to files, then do the final mix/SSE pass separately - it would require
    recomputing only the results of modified submodel.

    But unfortunately many other tasks - like testing new preprocessing ideas, or article reordering, or WRT dictionary optimization -
    would still require complete runs.

    Btw I actually did experiment with article reordering for enwik8... but I tried to speed-optimize it by compressing reordered files with ppmd instead of paq.
    Unfortunately after actual testing it turned out that article order that improves ppmd compression hurts it for paq.

  35. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    @CompressMaster:
    I don't provide that kind of service.
    You can see mcm results here: http://mattmahoney.net/dc/text.html#1449
    and download enwik9.pmd here: http://mattmahoney.net/dc/textdata.html

  36. #29
    Member
    Join Date
    Apr 2018
    Location
    Indonesia
    Posts
    74
    Thanks
    15
    Thanked 5 Times in 5 Posts
    looking at ltcb site, cmix v17 has beats phda9 but it use 25gb ram and more time. how about to reduce the variable setting ? is it still better than phda9 ??

  37. #30
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    903
    Thanks
    84
    Thanked 329 Times in 230 Posts
    Quote Originally Posted by Shelwien View Post
    >I suppose it would be better to install 256GB of RAM (so 490+345=835?), then run 8 instances at once.
    This CPU and motherboard only support 64GB max and CPU has 8 cores (16 threads) so in theory you can run 16 instances with 4GB memory each (64GB memory installed).

    For 64GB:
    Memory: Crucial Ballistix Sport LT 64GB (2666MHz) 285 euro.

    Total: 660 euro.

    For 128GB:
    Motherboard: ASRock X570M Pro4 (micro-ATX) 195 euro.
    CPU: AMD Ryzen 9 3900X (12 cores, 3.8-4.6GHz) 470 euro.
    Memory: HyperX Fury black 128MB (3200MHz) 675 euro.

    Total: 1515 euro.

Page 1 of 2 12 LastLast

Similar Threads

  1. Hutter Prize, 4.17% improvement is here
    By Alexander Rhatushnyak in forum Data Compression
    Replies: 90
    Last Post: Yesterday, 07:20
  2. Hutter Prize
    By CompressMaster in forum Data Compression
    Replies: 0
    Last Post: 3rd January 2019, 19:38
  3. Hutter Prize submission
    By Matt Mahoney in forum Data Compression
    Replies: 30
    Last Post: 26th October 2017, 20:29
  4. Hutter prize awarded
    By Matt Mahoney in forum Data Compression
    Replies: 2
    Last Post: 19th August 2009, 21:17
  5. The Hutter Prize
    By LovePimple in forum Forum Archive
    Replies: 7
    Last Post: 22nd September 2006, 12:28

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •