Page 3 of 4 FirstFirst 1234 LastLast
Results 61 to 90 of 91

Thread: Hutter Prize, 4.17% improvement is here

  1. #61
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    "Also, you can provide your own dictionary ..." -- that was a 3-minute answer, and here's a 30-minute:

    Why would I need Someone Else's code if I were young and smart and looking for a problem to solve, a challenge to take?
    Doesn't the fact that fast and small and simple SSE is still pretty useful (after all the enormous work performed by cmix and similar)
    tell you that PAQ-like algorithms (including phda and cmix) don't recognize any patterns in the
    model_number / probability_from_previous_models two-dimensional matrix?
    They only have a one-dimensional view of it, first horizontal, then vertical.
    But your algorithm does not have to look at this matrix at all !
    Come on, guys, forget about phda and cmix, invent something radically new!
    Like BWT and ANS were amazingly new solutions to old and boring problems.
    And for super-brief introductions to a couple problems that look really important,
    search for my last name on youtube.com (2nd half of the talk)

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  2. Thanks:

    image28 (26th February 2020)

  3. #62
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    245
    Thanks
    112
    Thanked 115 Times in 70 Posts
    Here are the results for version 1.6:


    enwik9 compressed size: 117039346 bytes
    size of decompression program in .zip: 41911 bytes
    total size (compressed file + decompression program): 117081257 bytes
    compression time: 84713.758 seconds
    decompression time: 88401.702 seconds
    compression memory: 4996672 KiB
    decompression memory: 4993904 KiB


    enwik8 compressed size: 15040647 bytes
    size of decompression program in .zip: 564616 bytes
    total size (compressed file + decompression program): 15605263 bytes
    compression time: 8946.996 seconds
    decompression time: 8904.394 seconds
    compression memory: 3799820 KiB
    decompression memory: 3836272 KiB


    Description of test machine:
    processor: Intel Core i7-7700K
    memory: 32GB DDR4
    OS: Ubuntu 16.04

  4. Thanks:

    Alexander Rhatushnyak (24th October 2018)

  5. #63
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    The set of enwik-specific transforms
    from the 2017 Hutter Prize winning entry
    goes open-source today.
    Archive is attached, read_me is included.
    Attached Files Attached Files

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  6. Thanks (9):

    byronknoll (3rd January 2019),comp1 (3rd January 2019),Cyan (3rd January 2019),Darek (3rd January 2019),encode (3rd January 2019),Jyrki Alakuijala (3rd January 2019),Mike (3rd January 2019),milky (6th January 2019),xinix (3rd January 2019)

  7. #64
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    504
    Thanks
    184
    Thanked 177 Times in 120 Posts
    What size difference do these transforms make to phda? I'm curious to know how well it does as a generic text compressor vs dedicated to this one corpus.

    I guess the flip-side of this is to test what these transforms do to the performance of e.g. cmix. It's already matching phda9 for size, albeit considerably off for CPU and memory. It'd take a long while to know though!

    Anyway, thanks for making them public.

  8. #65
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    Quote Originally Posted by byronknoll View Post
    Here are the results for version 1.6:
    ...
    enwik8 compressed size: 15040647 bytes
    If you compress preprocessed_enwik8 instead of enwik8: 15015895 bytes.
    The difference is much bigger in the compressor that may use only 1 GiB of RAM.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  9. Thanks:

    JamesB (4th January 2019)

  10. #66
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    245
    Thanks
    112
    Thanked 115 Times in 70 Posts
    @Alex, thanks for posting these transforms!

    Quote Originally Posted by JamesB View Post
    I guess the flip-side of this is to test what these transforms do to the performance of e.g. cmix. It's already matching phda9 for size, albeit considerably off for CPU and memory. It'd take a long while to know though!
    cmix currently compresses enwik8 to 14965334.
    preprocessed_enwik8 compresses to 14953755.

  11. #67
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    36445248 enwik8.gz --- 103.44%
    35233985 preprocessed_enwik8.gz

    29008758 enwik8.bz2 --- 101.78%
    28502591 preprocessed_enwik8.bz2

    24861205 enwik8.7z --- 101.41%
    24515695 preprocessed_enwik8.7z

    gzip 1.6 -9
    bzip2 1.0.6 -9
    7z 9.20 -t7z -mx=9


    Update:
    Also, with phda9 the size of enwik8 before DRT decreases by ~5.1% (as in the read_me),
    but size after DRT decreases by ~8.36% : 60520510 ==> 55852090 bytes.
    Last edited by Alexander Rhatushnyak; 7th January 2019 at 04:36. Reason: Update

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  12. #68
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    245
    Thanks
    112
    Thanked 115 Times in 70 Posts
    Using the instructions in enwik8preproc.zip, I extracted the two files from the phda November 2017 release:
    1) preprocessed enwik8
    2) phda dictionary

    I tried testing these with cmix:
    cmix currently compresses enwik8 to 14947555 bytes.
    phda preprocessed enwik8 compresses to 14866787 bytes.
    cmix using the phda dictionary compresses enwik8 to 14878492 bytes.

    The dictionary used by phda is amazing! @Alex, I am hoping to use this dictionary for cmix - let me know if this is not OK. Recently I have been focusing on the DRT, including trying to improve the dictionary. I am learning a lot through trying different experiments, but I think I am a long way from being able to generate a dictionary even close to this level of performance.

  13. #69
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    Quote Originally Posted by byronknoll View Post
    I am hoping to use this dictionary for cmix - let me know if this is not OK.
    OK, but if you tried building a dictionary by yourself, that would be even better

    Quote Originally Posted by byronknoll View Post
    phda preprocessed enwik8 compresses to 14866787 bytes.
    cmix using the phda dictionary compresses enwik8 to 14878492 bytes.
    So the gain from enwik-specific preprocessing is 11705 bytes,
    very close to what was observed earlier:
    cmix currently compresses enwik8 to 14965334.
    preprocessed_enwik8 compresses to 14953755.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  14. Thanks:

    byronknoll (23rd January 2019)

  15. #70
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    245
    Thanks
    112
    Thanked 115 Times in 70 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    OK, but if you tried building a dictionary by yourself, that would be even better
    Thanks! Yeah, I will definitely continue trying to build a dictionary and improve DRT.

  16. #71
    Member
    Join Date
    May 2008
    Location
    France
    Posts
    83
    Thanks
    528
    Thanked 27 Times in 19 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    Version 1.6 is here: http://qlic.altervista.org/phda9.zip
    As usual, four executables are barely tested,

    hopefully they will work as expected.

    Compressed size of enwik9 should be 117'039'xxx.

    UPDATE: 117'039'346.
    New version from Alexander (same link)!

  17. #72
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    Version 1.7 is here: http://qlic.altervista.org/phda9.zip
    Improvement is below 0.1% on enwik9,
    but bigger on smaller text files,
    e.g. 0.4% on book1 from CC.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  18. Thanks (3):

    byronknoll (22nd February 2019),Jyrki Alakuijala (19th February 2019),Matt Mahoney (22nd February 2019)

  19. #73
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    245
    Thanks
    112
    Thanked 115 Times in 70 Posts
    Here are the results for version 1.7:


    enwik9 compressed size: 116940874 bytes
    size of decompression program in .zip: 43,274 bytes
    total size (compressed file + decompression program): 116984148 bytes
    compression time: 83712.733 seconds
    decompression time: 87596.519 seconds
    compression memory: 4999504 KiB
    decompression memory: 4996880 KiB


    enwik8 compressed size: 15023870 bytes
    size of decompression program in .zip: 565,352 bytes
    total size (compressed file + decompression program): 15589222 bytes
    compression time: 8907.486 seconds
    decompression time: 8868.092 seconds
    compression memory: 3802892 KiB
    decompression memory: 3838836 KiB


    Description of test machine:
    processor: Intel Core i7-7700K
    memory: 32GB DDR4
    OS: Ubuntu 18.04

  20. Thanks (2):

    Alexander Rhatushnyak (24th February 2019),Matt Mahoney (22nd February 2019)

  21. #74
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 796 Times in 488 Posts

  22. Thanks (5):

    Alexander Rhatushnyak (24th February 2019),avitar (24th February 2019),encode (22nd February 2019),Mike (22nd February 2019),schnaader (22nd February 2019)

  23. #75
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    836
    Thanks
    239
    Thanked 307 Times in 183 Posts
    Quote Originally Posted by Matt Mahoney View Post
    How do we get the large window brotli results on LTCB?

    These results have been available since 2016. https://groups.google.com/forum/m/#!...li/aq9f-x_fSY4

    LTCB reports 223'597'884 bytes

    Actually delivered with large window: 199'118'013 bytes

  24. #76
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    Version 1.8 is here: http://qlic.altervista.org/phda9.zip
    Memory usage is higher now, 6 GiB rather than 4.75.
    No other news.
    As usual, barely tested executables.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  25. Thanks (2):

    byronknoll (5th July 2019),Mike (5th July 2019)

  26. #77
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    245
    Thanks
    112
    Thanked 115 Times in 70 Posts
    Here are the results for version 1.8:


    enwik9 compressed size: 116544849 bytes
    size of decompression program in .zip: 42,944 bytes
    total size (compressed file + decompression program): 116587793 bytes
    compression time: 86182.993 seconds
    decompression time: 86305.520 seconds
    compression memory: 6319256 KiB
    decompression memory: 6316396 KiB


    enwik8 compressed size: 15010414 bytes
    size of decompression program in .zip: 558,298 bytes
    total size (compressed file + decompression program): 15568712 bytes
    compression time: 9162.225 seconds
    decompression time: 9258.572 seconds
    compression memory: 4800208 KiB
    decompression memory: 4836472 KiB


    Description of test machine:
    processor: Intel Core i7-7700K
    memory: 32GB DDR4
    OS: Ubuntu 18.04

  27. Thanks (3):

    Alexander Rhatushnyak (11th July 2019),Darek (9th July 2019),Matt Mahoney (9th July 2019)

  28. #78
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 796 Times in 488 Posts

  29. Thanks (3):

    Alexander Rhatushnyak (11th July 2019),Mike (10th July 2019),rarkyan (11th July 2019)

  30. #79
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    I am going to publish an article
    presenting this idea.
    Most likely on arxiv.org.
    Looking for a co-author or two.
    I believe what's already in the draft
    is worth at least 1/3 of this article.

    If you want to be a co-author, it's
    up to you how much you contribute,
    maybe as little as one sentence.
    A graph based on empirical data
    would clearly look better,
    but requires some work.
    No problem if you can't provide it,
    please do what you can. Thank you!
    Last edited by Alexander Rhatushnyak; 5th September 2019 at 16:07.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  31. Thanks (2):

    Hakan Abbas (2nd September 2019),Mike (2nd September 2019)

  32. #80
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    Same draft as a Google Doc: here

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  33. Thanks:

    Mike (8th September 2019)

  34. #81
    Member
    Join Date
    May 2008
    Location
    France
    Posts
    83
    Thanks
    528
    Thanked 27 Times in 19 Posts
    New rule published on 2019-12-03 (and cosmetic changes on 2020-01-15):
    Documented source code must be made publicly available under some OSI-approved license before the prize money will paid out.

  35. Thanks:

    xinix (16th January 2020)

  36. #82
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,915
    Thanks
    291
    Thanked 1,273 Times in 720 Posts
    I guess we need to work hard and post enough improvements for paq, so that Alex could make enough money to get bored of it and open-source phda.

    Also the recent entry uses 5G of memory (with swap to disk) and an archive for it basically has to be encoded on the same machine (because of floats):
    http://mattmahoney.net/dc/text.html#1165
    So its only possible to compete using the same tricks (its fairly similar to 5x difference in window size).

    I wonder why they can't do the sane thing and just reboot the contest with different sample data...
    say, same 100M of enwik text, but all markup, bibliography, references filtered out.
    And with higher memory limit, since 1G for 100M file means that only paq is applicable
    (because it basically stores statistics in compressed format (hashtable/FSM)).

  37. Thanks:

    xinix (16th January 2020)

  38. #83
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    Quote Originally Posted by Shelwien View Post
    Also the recent entry uses 5G of memory (with swap to disk)
    Sorry do you mean the latest HP entry? or the latest phda9 ?
    The latest HP entry uses less than 1 GiB of RAM, and a 176 MiB scratch file.
    The latest phda9 uses ~6.02 GiB and nothing like "swap to disk".

    Quote Originally Posted by Shelwien View Post
    and an archive for it basically has to be encoded on the same machine (because of floats): http://mattmahoney.net/dc/text.html#1165
    Really? Have you ever tried?

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  39. #84
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,915
    Thanks
    291
    Thanked 1,273 Times in 720 Posts
    > Sorry do you mean the latest HP entry? or the latest phda9 ?

    latest HP entry.

    > The latest HP entry uses less than 1 GiB of RAM, and a 176 MiB scratch file.

    5G is reported on LTCB.
    Or is it asymmetric and encoder uses 5G while decoder fits in 1G?
    Then its my mistake.
    I presumed that it would fit under time constraint even when working
    with swap on 1G of physical memory.

    > Really? Have you ever tried?

    Tried this now - it didn't run because of too old glibc.
    I'd try again tomorrow whether it would work by patching glibc version in unpacked decoder.

    Also I remember some archive which still had windows decoder, but it didn't decode.

    And I doubt that you would be able to compile that decoder from source now and get it to decode the entry archive.
    Its not like cmix etc don't have the same problem, but in this case it makes contest entries hard to verify,
    since math functions (log etc) would likely work differently in different glibc versions and/or on different linux distributions.

    -------------

    Anyway, I presume that you don't want to open-source phda.
    It may be your right (or not, since paq and cmix are GPL), and in any case nobody can force you.
    But can you please tell that to contest organizers, so that they can reboot the contest and let others participate?
    Because competing with you in current contest (if its even possible) requires for everybody to stop posting any CM improvements for years.

  40. Thanks:

    xinix (18th January 2020)

  41. #85
    Member
    Join Date
    Feb 2020
    Location
    New Zealand
    Posts
    2
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Early in the morning down under, thought I would quickly introduce myself.

    As in my bio, dug up some old code from my 20's yesterday after finding this site.
    Still early stages yet. wrote a quick bash test to compress the base enwiki9(*) 1gb file down two ~500mb using mostly grep regex's to generate dictionaries. while I sift through 20 years of backups.

    (*) updated, had en8 but was talking about enwiki9,
    had a look at improving the regex's a minute ago (seperating the xml tags may yield a better results)
    e.g
    cat enwik9 | grep -iEo "(<|<\/)[A-Z0-9]*>" | sort | uniq > enwik9-tags
    cat enwik9 | grep -iEv "(<|<\/)[A-Z0-9]*>" | grep -iEo "[A-Z0-9]* " | sort | uniq > emwik9-words
    Forgot the regex I used to extract all other symbols :T


    One of my best algo's may be ready to go if I can find it. Only ever tested it on already compressed data mp3/xvid files, but could out compress tar/gz/bz2/rar/zip in those test cases.
    ( Once ran it for 5 days of iterations before it was compressing less than 8 bits per iteration. Decompression was much faster ).

    Will try find something to post each week over the coming weeks if ya'll are keen!

    Anyway, Hello, happy Wednesday/Thursday depending where you are,
    Kevin
    Last edited by image28; 26th February 2020 at 16:47.

  42. #86
    Member
    Join Date
    Feb 2020
    Location
    New Zealand
    Posts
    2
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Friday here, running my first test on enwik9, going out to enjoy the afternoon while it runs. Had to slow the algo down to fit the ram requirements. Algorithm I'm working on still has a few issues. Final version should be around 100 lines of vanilla c code ( no includes base linux gcc ), if I don't decide to rewrite it in asm. Compresses fill in iteration's, each iteration compressing the file to 3/4th's the previous iteration size. Ran into some issues with fseek and not liking my use of signed integers ( u_int32_t ). Worked around it by re-reading the file to position repeatedly. ( quick hack to make sure it could compress and decompress a test file. Slowing the program right down in the process ). Currently uses around 5GB of ram. Will update with the time it takes tomorrow more.

    Happy Friday!

  43. #87
    Member
    Join Date
    Feb 2020
    Location
    here
    Posts
    15
    Thanks
    5
    Thanked 0 Times in 0 Posts

    Exclamation

    howdoyoudo,people!
    unpacked file in attach causes segfault in lastest phda9, both with preproc and without at 1%!
    thanks for attantion!
    p.s.who will say that i'm useless person? i'm talanted tester
    https://www.youtube.com/watch?v=0N9PCPW2fpY
    Attached Files Attached Files

  44. #88
    Member
    Join Date
    Feb 2020
    Location
    here
    Posts
    15
    Thanks
    5
    Thanked 0 Times in 0 Posts
    i have been testing phda9
    $./phda9 C9 enwik9 out
    is it right? C9 - is it right option for enwik9?as you can see behind phda9 is not universal compressor, so i ask for right using instructions!
    $./phda9 C enwik9 out
    91% Segmentation fault
    $
    so C9 - is strictly for enwik9?
    Last edited by well; 25th March 2020 at 12:28.

  45. #89
    Member
    Join Date
    Apr 2020
    Location
    U.S.
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Same problem here as above.
    ​When I use 'C' on other large enwik text files, it will return segmentation fault at different percentage, as well as use 'C' instead of 'C9' on enwik9, fault at 91%.

  46. #90
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    248
    Thanks
    45
    Thanked 103 Times in 53 Posts
    C9 is only for enwik9
    C is for other files, but they can't be as big as enwik9.

    There's something in readme.txt about sizes.

    Sorry, it looks like this year I won't have
    more than 5 minutes per month for this.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

Page 3 of 4 FirstFirst 1234 LastLast

Similar Threads

  1. Hutter Prize submission
    By Matt Mahoney in forum Data Compression
    Replies: 30
    Last Post: 26th October 2017, 20:29
  2. Hutter prize awarded
    By Matt Mahoney in forum Data Compression
    Replies: 2
    Last Post: 19th August 2009, 21:17
  3. Forum improvement
    By Lasse Reinhold in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 13th May 2008, 16:48
  4. Alexander Rhatushnyak wins Hutter Prize!
    By LovePimple in forum Forum Archive
    Replies: 1
    Last Post: 5th November 2006, 18:04
  5. The Hutter Prize
    By LovePimple in forum Forum Archive
    Replies: 7
    Last Post: 22nd September 2006, 12:28

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •