View Poll Results: Public or Private?

Voters
16. You may not vote on this poll
  • Public + decompressor size

    7 43.75%
  • Private

    9 56.25%
Results 1 to 16 of 16

Thread: Compression contest: public or private dataset

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts

    Compression contest: public or private dataset

    English plaintext compression task.

    1. Public dataset:
    - we have to add decompressor size to compressed size;
    - encourages compiler tweaks, discourages speed optimization via inlining/unrolling (lzma.exe.7z: 51,165; zstd.exe.7z: 383,490)
    - encourages overtuning (participants can tune their entries to specific data);
    - embedded dictionaries are essentially blocked;

    2. Private dataset:
    - decompressor size has to be limited for technical reasons (<16MB?)
    - embedded dictionaries can be used fairly
    - embedded dictionaries actually improve compression results (see brotli vs zstd benchmarks)
    - we can post hashes of private files in advance, then post the files after end of contest

    Its for an actual contest that is being prepared. Please vote.

  2. #2
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    250
    Thanks
    46
    Thanked 105 Times in 54 Posts
    The question is not only about the English Texts set. There will be a few other test sets.

    Option 1:
    -- Test data are 100% available, participants download the test data just once, and don't have to worry about building a testing set.
    -- It's up to participants to decide what data, if any, to embed into their executables.
    -- Compressed size of decompressor is added to the compressed size of the test set.

    Option 2:
    -- Test data used for ranking are unavailable to participants, they have to build their testing sets themselves (organizers will provide some samples).
    -- There's a limit on decompressor size (most likely ~16 MB) and if your decompressor is smaller, it's your problem.
    -- Encourages pushing dozens of megabytes of data into executables.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  3. Thanks:

    Shelwien (21st May 2020)

  4. #3
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    107
    Thanks
    37
    Thanked 29 Times in 20 Posts
    Private dataset will allow large machine learning models like GPT-2. So this will be a machine learning contest.

    Only time will be a limiting model for very large NN models.

    If time is a limit maybe it will favor large statistical models instead of NN.

    But I am not an expert for GPT-2 like stuff. Other people should know better about it.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    There'd be both time limit and decompressor size limit, no GPT-2.
    2nd option just doesn't let decompressor size directly affect the ranking.

  6. #5
    Member CompressMaster's Avatar
    Join Date
    Jun 2018
    Location
    Lovinobana, Slovakia
    Posts
    198
    Thanks
    58
    Thanked 15 Times in 15 Posts
    Why public dataset? Well...
    Quote Originally Posted by Alexander Rhatushnyak View Post
    -- Test data are 100% available, participants download the test data just once, and don't have to worry about building a testing set.
    -- It's up to participants to decide what data, if any, to embed into their executables.
    -- Compressed size of decompressor is added to the compressed size of the test set.
    Please hit the "THANKS" button under my post if its useful for you.

  7. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    I agree that public dataset is more convenient.
    But it automatically requires adding decoder size to result - to counter putting part of data into decoder.
    However decoder size can take a considerable share of result - easily ~0.1% for zstd, which can affect the ranking.

    So decoder size optimization would become a part of the competition - splitting decoder and encoder,
    removing unnecessary compiler startup/libs, testing different compilers and their options for best size,
    finding the best exepacker, etc - and we're not really interested in that.

    On other hand, there's a practically useful option of including pre-built dictionaries in the decoder -
    for example, brotli and paq8px have these. These dictionaries do increase the decoder size, but also improve compression.
    Adding decoder size to result handicaps this, although decoder size doesn't matter for practical use - its only supposed to be a measure against exploits.

  8. #7
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    260
    Thanked 242 Times in 121 Posts
    I voted for "private" because:
    - We already had so much public contests/datasets/benchmarks, something different is welcome
    - Interesting things can be done using dictionaries (e.g. MFilter deleting well-known ICC profiles from JPEG)
    - NN and ML might take the lead, but 1) it's interesting to see how they perform too and 2) I think there also will be very interesting hand-crafted entries, too
    http://schnaader.info
    Damn kids. They're all alike.

  9. #8
    Member
    Join Date
    Mar 2009
    Location
    Prague, CZ
    Posts
    62
    Thanks
    32
    Thanked 7 Times in 7 Posts
    I d say private data set sounds better, cause then there is a higher chance the resulting compressor/algorithm will have some more general use, not just something designed for one file, otherwise pretty useless. So id say it supports usefull development more. Maybe the decompressor size could be limited even more, 16MB is really a lot - well, that depends on whether dictionary and other help data was just a considered side effect of the option 2], or if it was intended to encourage ppl to use that in development

  10. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    1) This contest is intended for more practical algorithms - lowest allowed speed would be something like 250kb/s.
    So no PAQs,NN/ML most likely.
    2) Archiver size can be pretty large - precomp+zstd could be 3Mb easily, more if there're several compression methods.
    3) Processing speed would be a part of ranking, so dictionary preprocessing won't be a free way to decrease compressed size.

  11. Thanks (4):

    AlexDoro (30th May 2020),mhajicek (23rd May 2020),vteromero (24th May 2020),xinix (24th May 2020)

  12. #10
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    260
    Thanked 242 Times in 121 Posts
    Another question that comes to my mind regarding a private dataset: will there be automation involved to get results quick? Because with a private dataset I imagine literally "shooting" hundreds of compressors a day using different dictionaries to analyze the data. So would this be a valid and working strategy?

    Alex' quote "organizers will provide some samples" points into the direction to reduce this a bit so you can also do offline using, but it would still be useful.
    http://schnaader.info
    Damn kids. They're all alike.

  13. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > "shooting" hundreds of compressors a day

    That won't really work with gigabyte-sized datasets.
    At slowest allowed speeds it would take more than a hour to compress it.

    Number of attempts would be limited simply because of limited computing power (like 5 or so).

  14. #12
    Member
    Join Date
    Sep 2007
    Location
    Denmark
    Posts
    925
    Thanks
    57
    Thanked 116 Times in 93 Posts
    I cant vote but i would vote private/secret

    The public dataset Encourage over tuning which is not really helpfull or a show of genreal compression rate. in the realworld the compressor does not know the data ahead of compression time.

    I would still add size+ de compressor though

  15. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Adding decompressor size requires absurd data sizes to avoid exploits (for 1GB dataset, compressed zstd size is still ~0.1% of total result)
    Otherwise the contest can turn into decoder size optimization contest, if intermediate 1st place is open-source.

    Also Alex pushes for a mixed dataset (part public, part private, with uncertain shares),
    but I think that it just combines negatives of both options
    (overtuning still possible on public part, decoder size still necessary to avoid exploits, compressed size of secret part still not 100% predictable in advance).

  16. #14
    Member
    Join Date
    Jan 2016
    Location
    Anywhere
    Posts
    4
    Thanks
    18
    Thanked 1 Time in 1 Post
    Well I can't vote but I would go with private dataset option.

    Few of the reasons why I prefer Option 2 over Option 1 are:
    1. The resulting compressor/algorithm have more general use in practical ways than a compressor which is optimized for a specific file/dataset which is pretty useless most of the time if you see.
    2. Allowing the use of dictionary is also a great add on in the contest.
    3. I have no problem(and I suppose most of people won't have) if a algorithm/compressor is using 10 methods(precomp+srep+lzma etc..) or just modifying 1 method (like lzma) to produce better results on multiple datasets until its getting the results which could be used as a better option in practical ways on multiple datasets.


    Quote Originally Posted by Shelwien View Post
    1) This contest is intended for more practical algorithms - lowest allowed speed would be something like 250kb/s.
    So no PAQs,NN/ML most likely.
    2) Archiver size can be pretty large - precomp+zstd could be 3Mb easily, more if there're several compression methods.
    3) Processing speed would be a part of ranking, so dictionary preprocessing won't be a free way to decrease compressed size.
    I totally agree with these three points by you as well and it would be very great to have a contest like this. And according to me, I wouldn't even care for 16MB compressor if it really saves more size than any other compressor when I compress a 50GB dataset to something like 10GB while other compressors are around 12GB, so a 16mb compressor is a mere small size to account for but anyways its a competition so we take account of everything so fine by me

  17. #15
    Member
    Join Date
    Jun 2014
    Location
    Ro
    Posts
    24
    Thanks
    5
    Thanked 6 Times in 5 Posts
    I would vote for private.

    I would also like to see some other limitations of the contest: I read that there would be a speed limit, but what about a RAM limit. There are fast NN compressors, like MCM, or LPAQ. I mean they could be a starting point for some experimental fast compressors. It's hard to fight LZ algorithms like RAZOR so I wouldn't try going in that direction. Are AVX and other instruction sets allowed?

    What would be nice is some default preprocessing. If it's an english benchmark, why shouldn't .drt preprocessing (like the one from cmix) be available by choice (or .wrt + english.dic like the one from pax8pxd). It would save some time for the developers not to incorporate them into their compressors, if there were a time limit for the contest.

  18. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    I would also like to see some other limitations of the contest:

    > I read that there would be a speed limit, but what about a RAM limit.

    I guess there would be a natural one - test machine obviously won't have infinite memory.

    > There are fast NN compressors, like MCM, or LPAQ.

    Yes, these would be acceptable, just not full PAQ or CMIX.

    > It's hard to fight LZ algorithms like RAZOR so I wouldn't try going in that direction.

    Well, RZ is a ROLZ/LZ77/Delta hybrid.
    Its still easy enough to achieve better compression via CM/PPM/BWT (and encoding speed too).
    Or much faster decoding with worse compression.

    > Are AVX and other instruction sets allowed?

    Yes, but likely not AVX512, since its hard to find a test machine for it.

    > What would be nice is some default preprocessing.
    > If it's an english benchmark, why shouldn't .drt preprocessing (like the one from cmix)
    > be available by choice (or .wrt + english.dic like the one from pax8pxd).

    I proposed that, but this approach has a recompression exploit -
    somebody could undo our preprocessing, then apply something better.

    So we'd try to explain that preprocessing is expected and post links to some open-source
    WRT implementations, but the data won't be preprocessed by default.

    > It would save some time for the developers not to incorporate them into their compressors,
    > if there were a time limit for the contest.

    It should run for a few months, so there should be enough time.

    There're plenty of ways to make a better preprocessor, WRT is not the only option
    (eg. NNCP preprocess outputs 16-bit alphabet),
    so its not a good idea to block that and/or force somebody to work on WRT reverse-engineering.

Similar Threads

  1. Potential compression contest
    By Shelwien in forum Data Compression
    Replies: 22
    Last Post: 18th February 2020, 13:12
  2. Private Internet Access PIA VPN 403 error
    By cbloom in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 3rd February 2015, 05:46
  3. Geocities.com archive dataset
    By FatBit in forum Download Area
    Replies: 2
    Last Post: 10th May 2014, 23:29
  4. a small data compression contest on hackerrank.com:
    By Alexander Rhatushnyak in forum Data Compression
    Replies: 0
    Last Post: 16th December 2013, 03:24
  5. LZMA SDK is Public Domain now
    By Vacon in forum Data Compression
    Replies: 1
    Last Post: 27th November 2008, 20:07

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •