Results 1 to 23 of 23

Thread: Potential compression contest

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts

    Exclamation Potential compression contest

    There's a sponsor interested in holding a contest similar to Hutter Prize (longterm etc), but with a more practical focus.
    (ppmonstr,oodle,zstd are relevant, most paq* and lz4 are not).

    Please suggest compression-related tasks, such that
    1) their solutions would be useful in practice
    2) solutions won't require months of work to develop
    3) there won't be an already existing unbeatable solution without any competition

  2. Thanks (3):

    Amsal (15th January 2020),Hakan Abbas (16th January 2020),xinix (15th January 2020)

  3. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    compressing games LOL... and the contest may be called Compression Games LOL

  4. Thanks:

    Piglet (16th January 2020)

  5. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    Something like lossless DDS compression may be relevant actually.
    But it would be hard to find legal samples.
    ...And then profrager would immediately win it, since he already has the tools.

    I was thinking more about something like filesystem compression (with random access to blocks, but global dedup/dictionary).

  6. Thanks (2):

    Amsal (15th January 2020),xinix (15th January 2020)

  7. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    780
    Thanked 687 Times in 372 Posts
    of course i'm joking. disk image is great, but how you can build it to include various data types? and by no means, it will provoke studying of this concrete image without much benefit for real world

    may be games still can be downloaded (via Steam or so) without breaking law? We aren't going to play them, after all

  8. #5
    Member
    Join Date
    May 2017
    Location
    Germany
    Posts
    88
    Thanks
    55
    Thanked 41 Times in 25 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    of course i'm joking. disk image is great, but how you can build it to include various data types? and by no means, it will provoke studying of this concrete image without much benefit for real world

    may be games still can be downloaded (via Steam or so) without breaking law? We aren't going to play them, after all
    Pretty big part of games is assets. There's plenty of asset packs readily available on the net, often free and already with textures as .DDS, 3D models as COLLADA, sounds as OGG/MP3.

    Sometimes game developers are accused of asset theft because two games have the same textures, when in fact they bought the same asset pack.

  9. Thanks:

    Amsal (15th January 2020)

  10. #6
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    106
    Thanks
    37
    Thanked 29 Times in 20 Posts
    Maybe compression of a linux distribution image. And requirement of 1MB/s compression and 500MB/s decompression speed.

  11. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    > disk image is great, but how you can build it to include various data types?

    I suppose we could use some linux VM image, like https://www.offensive-security.com/k...mage-download/

    But with codecs in a form of .dll/.so plugin for my benchmark program, rather than standalone.

    Although I'm more interested in general-purpose algorithm improvements (eg. zstd speed with lzma-like compression),
    rather than preprocessing.

    Well, small block/random access requirement adds some interesting changes to preprocessing though...
    most recompression won't work without access to whole structure (although jojpeg would work for first blocks of jpeg files),
    some preprocessors would work (delta,mm), some won't (BCJ/BCJ2).

    So there're two new topics for preprocessing:
    1) Solid encoding, blockwise random-access decoding.
    All usual algorithms still can be used, but require output format changes to make blocks individually decodable.
    Ideally there'd be a single common dictionary rather than inter-block references.
    2) Random block encoding.
    In theory it may be possible to cache and reorder blocks to enable parsing.
    Otherwise its still possible to build a global dictionary, so compression still can be better
    than independent compression of individual blocks.

    > and by no means, it will provoke studying of this concrete image without much benefit for real world

    I was thinking to perform actual entry test on non-public data to avoid overtuning.

    > may be games still can be downloaded (via Steam or so) without breaking law? We aren't going to play them, after all

    It may be possible with eg. warframe (although its resources are packed with oodle, so we'd need to write a dumper too),
    or we could find some open-source game.
    But game formats are not too interesting for me, since they only appear in games, but do not appear in storage systems.

    Anyway, I'd prefer something general-purpose, like vector-rANS port of ppmd, or lzma with explicit fuzzy matches.

  12. Thanks:

    Amsal (15th January 2020)

  13. #8
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    If you want to compare general purpose code and not simply who has the best set of recognition scripts and filters then perhaps that should be part of the submission requirements. No explicit recognition of data types with repacking; just basic modelling. Are the tools expected to be open source? (In which case people can validate that if they wish.) Perhaps this can also be adding (rotating) every byte with a fixed value or a fixed XOR of low bits. It may slightly harm some multi-byte delta techniques, but it'd utterly cripple the custom file format detection.

    Edit: eg

    Code:
    $ cat /usr/share/dict/words | zstd -19|wc -c
    216749
    $ cat /usr/share/dict/words | perl -e '$/=undef;print map{chr(ord($_)^1)}split("",<>)'| zstd -19|wc -c
    216736
    $ cat /usr/share/dict/words | perl -e '$/=undef;print map{chr((ord($_)+1)&255)}split("",<>)'| zstd -19|wc -c
    216752
    $ cat /usr/share/dict/words | xz -9|wc -c
    196680
    $ cat /usr/share/dict/words | perl -e '$/=undef;print map{chr(ord($_)^1)}split("",<>)'| xz -9 |wc -c
    196632
    $ cat /usr/share/dict/words | perl -e '$/=undef;print map{chr((ord($_)+1)&255)}split("",<>)'|xz -9|wc -c
    197200
    Eg I tried paq8px183 on a local jpg file, before and after XOR 1 and it changed from 63629 bytes (very fast) to 83961 bytes (very slow) indicating it'd broken the file type detection model. Xz on both files gave identical sizes (85512), due to it being already compressed data (86815 bytes).

    Basic transforms, with minimal impact on text compression.

    I like the idea of a public training set and a private test set to ensure real world applicability and to reduce over-fitting.

    The (specialist) SequenceSqueeze contest did this, with a framework for participants to submit AWS images (a bit complex sadly) which the organiser then ran with their own attached private S3 bucket containing the test data.

  14. Thanks:

    Shelwien (16th January 2020)

  15. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    > If you want to compare general purpose code and not simply who has the best
    > set of recognition scripts and filters then perhaps that should be part of
    > the submission requirements.

    Unfortunately its not that simple to choose.
    In practical applications the preprocessing would be very important.
    Also I doubt that I would have a budget that could motivate
    a development of something like rz or oodle.
    So most tasks where we could actually expect competition and
    practical useful improvements would be actually preprocessing.

    > Are the tools expected to be open source?

    I think not - sources would be harder to deal with (correctly compile to satisfy the author, etc),
    though I'd accept sources too.
    But the main format would be a dynamically-loaded binary library (.dll/.so) for my benchmark tool.

    > Perhaps this can also be adding (rotating) every byte with a fixed value or
    > a fixed XOR of low bits.

    I can use this for some part of benchmark.
    This could be useful for basic match coding evaluation.
    I don't expect people to easily beat ppmd or zstd in their categories, though.

    Also this "alphabet reordering" idea is not very secure.
    For example, for jpegs it would be possible to detect and recover
    original alphabet based on huffman/quantization tables.

    > The (specialist) SequenceSqueeze contest did this, with a framework for
    > participants to submit AWS images (a bit complex sadly) which the organiser
    > then ran with their own attached private S3 bucket containing the test data.

    I'd try to not overcomplicate it :)
    There'd be some troubles about windows/linux split though -
    I'm thinking to implement the same baseline solution for both
    and compare against it.

  16. #10
    Member
    Join Date
    Mar 2011
    Location
    USA
    Posts
    249
    Thanks
    113
    Thanked 123 Times in 72 Posts
    Here is a task suggestion: compressing public domain books from Project Gutenberg.

  17. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    That may be ok, although I've been thinking to try something more relevant - saved pages from web (html/css/js etc) or strings extracted from exes.
    (I recently analyzed a 12G windows VM image and 60% of it was exe/dll/sys, although half of them are duplicates;
    it also contains all the usual recompression targets, but the recompression gains would be too little and slow comparing to exe optimization).

    But the main question is how to setup benchmark metric and payouts so that it would allow eg. something like ppmd to compete.

  18. #12
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,496
    Thanks
    26
    Thanked 132 Times in 102 Posts
    Quote Originally Posted by Shelwien View Post
    There's a sponsor interested in holding a contest similar to Hutter Prize (longterm etc), but with a more practical focus.
    (ppmonstr,oodle,zstd are relevant, most paq* and lz4 are not).

    Please suggest compression-related tasks, such that
    1) their solutions would be useful in practice
    2) solutions won't require months of work to develop
    3) there won't be an already existing unbeatable solution without any competition
    Maybe change the objective, but still stay with enwik8? Hutter Prize limits the memory and time and rewards highest compression ratio. Maybe instead we should seek fastest compressors that operate within some memory limit and compress enwik8 to e.g. under 18,000,000 bytes.

  19. #13
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    83
    Thanks
    6
    Thanked 18 Times in 11 Posts
    I'm not sure what to think. If things should be practical we get just another dictionary coder, written in c(++) and fine tuned for the testdata.

    What I would suggest is artificially generated data. A collection of files with different (1kb, 1mb, 1gb) sizes and different probability distributions (easy, moderate, hard) on bit and byte level. Let's say 18 files in total. A public and a private set.
    People are allowed to train their algorithm on the public set. The contest creater/holder keeps the private set for the final scores.

    We could use a widely accepted tool like 7zip on default settings to set a ceiling size for every file. Competitors have to write a program that is able to compress equal or better than default 7zip and are judged by the sum of encoding and decoding time. I would also suggest to use a modern memory constraint like 8gb or something.

    But I would love to see a second competition where purely the size is the objective. From the past we learned that time or memory constraints diminish over the years and that innovative but impractical algorithms from the 80th's and the 90th's are just fine and mainstream now. Why not a memory constraint of 16gb and a time constraint of 24 hours for this competition and lets see what the competitors create?

  20. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    > Maybe change the objective, but still stay with enwik8?

    Enwik8 compression is basically only useful for enwik8 compression.
    Enwik file contains too much unique stuff like wiki markup, large tables in mixed html/wiki markup,
    bibliographies, reference lists, language lists.
    So some preprocessing tricks can significantly change the results even if we'd run a blockwise random-access benchmark on enwik.
    Which could be ok normally, but enwik tricks are not relevant for anything else.

    > Hutter Prize limits the memory and time and rewards highest compression ratio.
    > Maybe instead we should seek fastest compressors that operate within some memory limit and compress enwik8 to e.g. under 18,000,000 bytes.

    Under _19,000,000_ are basically only coders with WRT preprocessing and slow CM/PPM+SSE.

    I guess its possible to setup rules such that durilca would win, but none of paq derivatives,
    but it would keep most of drawbacks of HP contest (like having no applications, encouraging information gathering rather than programming, etc).

    I'm thinking about focusing on compression of small blocks (4k..1M) for random access, since

    1) Its an important (basically only) use-case in practical applications - filesystems, databases, storage, even games.
    But "compression scene" has very limited resources for it, 90% of LTCB entries are unfit,
    while a number of remaining ones require refactoring and new features (eg. external dictionary support)
    to be competitive.

    2) Benchmark order significantly changes in this case - many popular choices would become useless
    (eg. BSC has worse compression and is slower than ppmd), so it should be interesting.
    Well, in practical applications zstd would currently win (compression ratio + decoding speed + external dictionary support),
    but better ratios and faster encoding are possible, so there's stuff to work on.

    3) Its possible to make progress by tweaking existing codecs or preprocessors,
    which should let people participate in the contest without having to invest months of work.
    Attached Files Attached Files

  21. Thanks:

    JamesB (17th January 2020)

  22. #15
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    83
    Thanks
    6
    Thanked 18 Times in 11 Posts
    Quote Originally Posted by Shelwien View Post
    I'm thinking about focusing on compression of small blocks (4k..1M) for random access
    I love the idea. Maybe even smaller blocks like 32 byte or variable length? Just as in a real database? This way you can't just use zstd to win.
    You could use one or more open source databases as input.

  23. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    > I'm not sure what to think. If things should be practical we get just another dictionary coder,
    > written in c(++) and fine tuned for the testdata.

    PPM/fast_CM/BWT have faster encoding and better compression than LZ.

    And yes, I don't intend to support python/java/C#/JS - these developers can find their own solution
    for trans-compiling to C/C++, since there're no high-quality compilers for anything else.

    Also modern LZs have much higher algorithm complexity than anything else,
    and its still possible to improve their parsing optimization and entropy coding.

    It may be also possible to speed-optimize paq-like CMs.

    > What I would suggest is artificially generated data.

    That was attempted before: http://mattmahoney.net/dc/uiq/
    And failed fast enough, after a method of preprocessing was discovered.

    I'd rather have preprocessors for relevant data (eg. exes) than for something artifically generated for the contest.

    > The contest creater/holder keeps the private set for the final scores.

    I do intend to keep the test corpus private.

    > But I would love to see a second competition where purely the size is the objective.

    I have some hope that Hutter Prize would be rebooted for that.

    > Why not a memory constraint of 16gb and a time constraint of 24 hours for this competition and lets see what the competitors create?

    I agree on 16gb, but smaller time limit imho helps competitors.
    Otherwise it could easily become a competition in having access to more hardware to run some cmix/nncp hybrid.

    I'm actually more interested in taking off the decoder size limit somehow,
    to enable natural language parsing etc.
    But that would rely too much on having unique test data, which nobody can guess.

  24. #17
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    83
    Thanks
    6
    Thanked 18 Times in 11 Posts
    Quote Originally Posted by Shelwien View Post
    >
    > What I would suggest is artificially generated data.

    That was attempted before: http://mattmahoney.net/dc/uiq/
    And failed fast enough, after a method of preprocessing was discovered.
    I think that's a Baconian fallacy? All artificial generated data isn't a bad idea because Matt did it once and someone found a way to preprocess it for better compression. I think its still the way to go for an objective competition.

    Quote Originally Posted by Shelwien View Post
    I'd rather have preprocessors for relevant data (eg. exes) than for something artifically generated for the contest.
    Would hate this. I love the idea behind Paq8 but all the custom man made models for all sorts of cases (at least for me) go against the idea of general artificial intelligence.
    If the competition will be a game of hand crafted preprocessing and a bunch of custom man made models I would lose interest.


    Quote Originally Posted by Shelwien View Post
    I'm actually more interested in taking off the decoder size limit somehow,
    to enable natural language parsing etc.
    But that would rely too much on having unique test data, which nobody can guess.
    This would quickly lead to abuse of the rule I am afraid. It would be more interesting to allow the compressor to build/develop/find (unsupervised training) a custom model that acts like a preprocessor or custom model for the data and see this a separate step besides encoding/decoding.

  25. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    > All artificial generated data isn't a bad idea because Matt did it
    > once and someone found a way to preprocess it for better compression.

    Artifical data is bad because its artifical.
    It would work if we can reproduce the properties of the data which we are specifically targeting (eg. BWT output),
    but would prevent discovering new patterns (which is good if its fit for practical applications).

    When writing the data generator its easy to miss something, which can be exploited in the contest
    (eg. there're PRNGs in actual compiler standard libs which produce "random numbers" compressible by paq).

    > I think its still the way to go for an objective competition.

    I also considered it, but don't see how to make it useful and unexploitable.

    >> I'd rather have preprocessors for relevant data (eg. exes) than for something artifically generated for the contest.
    > Would hate this. I love the idea behind Paq8 but all the custom man made models for all sorts of cases

    Its a good way to create actual competition, with multiple people making improvements.

    paq8 is too complicated and "solid" (not modular), so imho its effectively stopping CM development since its release.
    paq8 lacks many things (explicit parameters for component tuning, precision, support for enc/dec asymmetry),
    but a new CM with new features can't compete with paq8 in benchmarks, because first it needs to have
    all the same submodels (match, sparse, periodic, 2D, indirect), and there's no analysis of submodel
    contribution, which would explain to a new developer what has to be integrated to compete with paq8, and why.

    Btw, there I only listed "pure" universal models, which make sense just based on theory.

    > (at least for me) go against the idea of general artificial intelligence.

    This contest certainly won't be about AI. The (potential) sponsor is storage-related.

    Also I much prefer standalone preprocessors to a combination of complex features which combine to make use
    of some pattern in the data. For example, consider rep-matches in lzma (and clones) - they were (likely)
    introduced simply because somebody noticed that LZ distances are repeated sometimes.
    But then lzma's optimal parser has heuristics for checking multi-token patterns like match-literal-rep_match,
    which effectively let it choose longer "fuzzy matches" over simple nearest repeated substrings.

    > If the competition will be a game of hand crafted preprocessing and a bunch of custom man made models I would lose interest.

    We'll see I guess.
    I still think that making a good exe filter for blockwise zstd (for example) would be more interesting than tweaking paq8.
    If its not for you, Hutter's contest is still active.

    > It would be more interesting to allow the compressor to build/develop/find (unsupervised training)
    > a custom model that acts like a preprocessor or custom model for the data and see this a separate step besides encoding/decoding.

    In theory I agree, and that was supposedly the original Hutter's (and Mahoney's) intention for the contest.

    But in practice I don't believe in fully learning english from enwik, even if we'd get humans to work on it.
    Some aspects are discoverable (like words/spaces/punctuation, or parts of morphology; maybe synonyms),
    but actual understanding is only possible with external information (eg. pictures with objects known from external information),
    and some parts are impossible without supervision (rhymes, rhytm and other stuff related to phonetics; detailed morphology; idioms)
    or at least specially prepared descriptions fed to AI.

    Of course, some kinds of data analysis are still possible - for example, atm paq8 doesn't have parsing optimization.
    But I can't add it to paq (it requires backtracking etc) and if I'd make a new CM with it, it would have worse compression than paq,
    and thus there won't be any interest.

    Btw, I'd like your feedback on this project: https://encode.su/threads/3072-contr...p-tools-for-HP

  26. #19
    Member
    Join Date
    Nov 2015
    Location
    boot ROM
    Posts
    95
    Thanks
    27
    Thanked 17 Times in 15 Posts
    Quote Originally Posted by Shelwien View Post
    I was thinking more about something like filesystem compression (with random access to blocks, but global dedup/dictionary).
    Filesystems I've seen were using rather specific approach to deduplication. Filesystems are tricky. They face non-trivial performance challenges, and these are different vs tradeoffs encountered in general purpose compression.
    1) Filesystems deal with block devices. Devices unable to address anything smaller than "sector". This implies filesystems have to live in aligned world. Otherwise, they have to do read-modify-write, it implies awful performance. So merely adding or removing entry to dictionary isn't simple - you rather have to consider if it could happen in "heavily aligned" manner.
    2) Its common to optimize by aligning everything to block size or multiple of that (typical would be 4K, that also matches CPU pages on most archs, so good for performance). Unfortunately, incomplete use of blocks implies wasted space and there is some tradeoff. Being both compact and fast ... is a bit like holy grail of filesystems design.
    3) Filesystems should cope with access in truly random manner. This prompts them to use independent compression on faitly small blocks. Furthermore, there are often wasteful optimizations. Like, say, if we managed to compress 2 blocks to 1 block, go for it. But if 2 blocks compressed to 1.5 blocks, it's no-win. Since you still have to shuffle 2 blocks, and attempt to use remaining space in half-empty block ... brings a lot of woes: there have to be some extra metadata to describe that, it adds ton of overhead and you may have to do read-modify-write far more often than desirable. Filesystems also often supposed to have at least some plan what happens if e.g. power suddenly lost during all this action. RMW is dangerous in this regard. Attempts to do something about it going hurt performance even more.
    4) When it comes to actual dedup, it also challenging in terms of performance. Say, if you reference same data 20 times, well, now imagine it changes. At some point you may have to update 20x metadata. Deleting dictionary entry would be anything but straightforward, isn't it?
    5) Now things are even more challenging, now there're very fast (e.g. NVMe) SSD devices - and it clearly heading this direction.
    6) This implies filesystems are rather complicated and delicate things. So getting anyhow advanced design on wheels takes some years. Getting rid of bugs takes even more years.

    Designs I've seen were doing dedup on block basis. If certain block(s) match exactly, ok, it can be replaced by references and it can happen in manner acceptable from performance point of view. But it wouldn't catch same file with 10 bytes inserted at start. So it rather crude and special kind of "dictionary". And even then, dealing with heavily references sets of blocks could get slow. I think I've seen some backup programs trying to do more or less something similar to what descrived, but backups do not have to deal with random read-write accesses where performance of operations is at premium.

    So filesystems compression is a bit specific and have to perform reasonably on independent blocks, fairly limited in total size for the sake of random access. It also better to perform reasonably on something like 4K or several times of that - this implies lack of heavy initialization phase could be bonus, etc. There're some exceptions, like e.g. compressed ROM filesystems, that can eventually afford a bit larger blocks (e.g. 256K squashfs). But larger blocks bring own tradeoffs, taking more RAM for cache and slowing down random access, and all that usually on low-power devices with limited CPU and RAM (at least ROM filesystems are relatively uncommon on x86, except maybe "initial ramdisks").

    So in perfect world, filesystem compression likely
    1) Can compress reasonably even rather small, independent set of data, like few Kb or so.
    2) Lacks expensive initialization at begin of compress/decompress (to deal with numerous independent blocks reasonably).
    4) Blazing fast to decompress.
    5) Ideally, same for compression as well. Without nasty pathological cases. It perfectly valid scenario if user would write megabytes of zeros, incompressible data or something - getting stuck or seriously slowed down is quite a problem.

  27. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    There're plenty of ready-only or mostly-RO cases, like VMs, where we could optimize the image first...
    but there're no formats with advanced compression (like CDC dedup, global dictionary, modern entropy-coding, preprocessing).

    Its easy enough to implement a custom FS: https://github.com/dokan-dev/dokany https://github.com/codedhead/archivefs
    But there's no low-delay container format which could combine state-of-art compression with random access.
    There's nothing impossible in it, or even especially hard - developers are just more interested in compressing enwik10 with 32GB of RAM.

  28. #21
    Member
    Join Date
    Jul 2014
    Location
    Mars
    Posts
    198
    Thanks
    135
    Thanked 13 Times in 12 Posts
    Contest goal could be outperforming for example 7z LZMA2 compression of multiple data set in terms of same or lower time and same or better compression.

  29. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,973
    Thanks
    296
    Thanked 1,299 Times in 736 Posts
    Sure, but its pretty hard to setup a fair competition with just that.
    LZMA has old style (from time when out-of-order cpus didn't exist) entropy coding,
    but there're several lzma-based codecs developed for new cpus - LZNA etc.
    These new codecs don't really beat lzma in compression, but do have 5-8x faster decoding.
    Also there's RZ and various MT schemes (including PPM/CM) which would consistently beat 7z LZMA2.

    Problem is, I'd like this to be an actual competition and not just finding and awarding the best non-public preprocessor developer.

  30. #23
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    506
    Thanks
    187
    Thanked 177 Times in 120 Posts
    I guess if you could find funding for a virtual machine instance (eg AWS of Google Cloud support) then one way of "winning" the competition is being on the Pareto frontier - either encode or decode. It encourages research into all aspects of data compression rather than simply chasing the best ratio. It can be on hidden data sets too (SequenceSqueeze competition did that) so it's slow and hard to over-train to one specific set.

    The reason for a standard virtual machine is automation and reproducibility. Either that or one person has to run every tool themselves for benchmarking purposes.

    Edit: I'd also say some subtle data permutations of input could scupper the race for the best preprocessors, especially if the permutations are secret, while still keeping the same general data patterns. Eg something as dumb as rotating all bytes values by 17 would completely foul many custom format specific preprocessors while not breaking general purpose things like automatic dictionary generation or data segmentation analysis.

Similar Threads

  1. posit - a potential replacement for IEEE floats?
    By Shelwien in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 25th September 2019, 14:15
  2. The Underhanded Crypto Contest
    By Mangix in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 19th September 2014, 04:46
  3. a small data compression contest on hackerrank.com:
    By Alexander Rhatushnyak in forum Data Compression
    Replies: 0
    Last Post: 16th December 2013, 03:24
  4. Another? DNA contest
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 8th February 2012, 16:17
  5. Metacompressor.com first contest
    By Sportman in forum Data Compression
    Replies: 17
    Last Post: 15th October 2008, 00:50

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •