Page 1 of 4 123 ... LastLast
Results 1 to 30 of 92

Thread: GDC Competition: Discussions

  1. #1
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts

    GDC Competition: Discussions

    Hi,
    This thread is for discussions related to the Global Data Compression Competition.
    My name is Maxim and I’m the first person to blame in case of anything.
    Note that any and all formal announces are made only on the official website or from
    Globalcompetition at compression ru email address.
    Check https://globalcompetition.compression.ru/ and https://globalcompetition.compression.ru/rules/
    in particular for rules and explanations on how to submit your compressor.

  2. Thanks (11):

    Alexander Rhatushnyak (17th June 2020),algorithm (18th June 2020),compgt (30th June 2020),Hakan Abbas (18th June 2020),JamesWasil (17th June 2020),Mike (17th June 2020),Noragami (18th June 2020),schnaader (17th June 2020),Shelwien (17th June 2020),Sportman (20th June 2020),vteromero (17th June 2020)

  3. #2
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    107
    Thanks
    37
    Thanked 29 Times in 20 Posts
    For the Test 4 for random access 32KB, I think the time limits are very high.

    It usually used for Transparent compression in filesystems and for compressed swap like zram where speed is usually the bottleneck.

    EDIT: Or maybe decrease the constant on compressed size. 10^-6 seams a bit high to me. Also a geometric formula like c_time^α * d_time^β * size^γ seems like a better formula to me.
    Last edited by algorithm; 18th June 2020 at 00:58.

  4. #3
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts
    Quote Originally Posted by algorithm View Post
    For the Test 4 for random access 32KB, i think the time limits are very high.

    It usually used for Transparent compression in filesystems and for compressed swap like zram where speed is usually the bottleneck.
    I agree as far as this use case is concerned. The limits were discussed internally and the final numbers are what they are because:
    * we do want to use the same thresholds for all categories;
    * the mentioned use case is not the only one for small-block compression.

    I believe the major problem we run into while discussing and devising the rules is the lack of statistics on demands for lossless data compression in the industry. It's problematic to assign weights to different use cases depending on their frequency of use, benefits for the world economy, carbon footprint, human happiness index or whatever. We just don't know this frequency and the rest. Say, if we would use the global Internet traffic statistics as the foundation, then the competition should be about dealing with DCT coefficients and various syntax elements of a video stream. But this is just one facet of, well, a multidimensional phenomenon.

  5. #4
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts
    Quote Originally Posted by algorithm View Post
    Also a geometric formula like c_time^α * d_time^β * size^γ seems like a better formula to me.
    It may seem better, yes. But it was discussed and dismissed.
    Such formulas make sense if we try to answer questions like "how hard is it to reach this level of excellence?"
    We are interested in a practical economical result, not the volume of efforts. We believe that normally the cost function is linear.
    20% of efforts gave 80% of results? The user does not care.

  6. #5
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    107
    Thanks
    37
    Thanked 29 Times in 20 Posts
    Quote Originally Posted by Ms1 View Post
    It may seem better, yes. But it was discussed and dismissed.
    Such formulas make sense if we try to answer questions like "how hard is it to reach this level of excellence?"
    We are interested in a practical economical result, not the volume of efforts. We believe that normally the cost function is linear.
    20% of efforts gave 80% of results? The user does not care.
    I don't understand. The formula changes the order of ranking. It is the same as geometric mean when benchmarking.

    A practical example
    Code:
    codec1: c_time 100MB/s d_time 1GB/s ratio 45% -> score 10+2+450 = 462
    codec2: c_time 1GB/s d_time 10 GB/s ratio 47% -> score 1 + 0.2 + 470 = 471.2
    But the codec 2 is close to 10x better:

    compare to geometric formula c_time^0.5*d_time^0.5*(size/10,000,000)^1.5

    α,β,γ should satisfy α+β <= γ
    /10,000,000 doesn't change order

    Code:
    codec1: 10^0.5*2^0.5*45^1.5 = 1350
    codec2: 1^0.5*0.2^0.5*47^1.5 = 144
    so 9.3x better
    Last edited by algorithm; 18th June 2020 at 12:33. Reason: typo

  7. #6
    Member
    Join Date
    Dec 2019
    Location
    Japan
    Posts
    3
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Добрый день, тестовые данные 3 и 4 еще недоступны или я просто не могу их найти?

  8. #7
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    250
    Thanks
    46
    Thanked 105 Times in 54 Posts
    Quote Originally Posted by Noragami View Post
    Добрый день, тестовые данные 3 и 4 еще недоступны или я просто не могу их найти?
    (Good afternoon, test data 3 and 4 are not yet available or I just can’t find them?)

    Not available yet. We are working on this.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  9. #8
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts
    Quote Originally Posted by algorithm View Post
    A practical example
    Code:
    codec1: c_time 100MB/s d_time 1GB/s ratio 45% -> score 10+2+450 = 462
    codec2: c_time 1GB/s d_time 10 GB/s ratio 47% -> score 1 + 0.2 + 470 = 471.2
    This normally should not happen. We limit the time. We practically suggest participants to align speed (ctime+dtime) with the thresholds. It's true, however, that different algorithms and implementations may have an edge over others because of a more flexible control over time/compression ratio.
    We might use Pareto frontiers to filter out "suboptimal" compressors, but this would leave the question of what to do with Pareto optimal objects unsolved. Thus we fix the execution time.
    We don't really know what are the "true" weights of csize, ctime and dtime. It depends on a given use case. We use the formula only to solve ties.

  10. #9
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    (Good afternoon, test data 3 and 4 are not yet available or I just can’t find them?)

    Not available yet. We are working on this.
    Hopefully, they will be available in a week.

  11. #10
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    Multithreading is prohibited: we allow one thread only.
    That is really disappointing. Obviously, your contest, your rules but it feels like a missed opportunity to test compressors using common features of modern CPUs (namely multi-cores).

    f = c_time + 2 × d_time + (1 / 1,000,000) × c_full_size.
    The formula looks fine. In fact it is pretty much the same as the one discussed here: https://encode.su/threads/3410-Fasci...ll=1#post65363
    So why use different rules for different categories ?

    For the balanced-compression and high-compression-ratio categories, we order results by compressed-data full size only
    It looks like changing the mixing coefficient (say something like 1/10000 or 1/1000) would achieve the same goal while keeping the same formula.
    I am also surprised that there are not stricter limits on memory. IMO, the risk is to end up with submissions using unrealistic amounts of memory just to win the contest.

    Each participant guarantees that the submitted compressor or compressors are that participant’s own work or, if they’re based on third-party source code, they differ essentially from that code and the participant has received all necessary permissions to modify and submit the code to this competition. Each participant must also guarantee that the submitted compressor or compressors violate no third-party intellectual-property rights.
    It is going to be hard to enforce (binaries may contain borrowed open source code with no attribution) but it is good to see this listed explicitly.

    Just my 2 cents.

  12. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    >> Multithreading is prohibited: we allow one thread only.

    > That is really disappointing. Obviously, your contest, your rules but it
    > feels like a missed opportunity to test compressors using common features
    > of modern CPUs (namely multi-cores).

    1) Its very hard to compare different codecs with MT, when speed is taken into account,
    especially on different operating systems.

    2) Proper MT design, precise load balancing to ensure 100% cpu load is a complex task on its own,
    and this a compression contest rather that "efficient MT job manager" contest.

    3) Sponsor is not interested in usual kinds (blockwise) of MT codecs at all.
    They would like it if somebody managed to MT small-block compression,
    but threading has high overhead, so cases where splitting the work within a 32kb block
    would improve compression speed are unlikely.

    > I am also surprised that there are not stricter limits on memory.

    1) Its hard to set strict memory limits - memory usage changes during processing,
    sum of allocated block sizes is different from number of actually mapped pages,
    its easy enough to implement custom swap via guard pages, etc.

    2) There's a low limit on allowed speed (4000s for 1GB = ~250KB/s)
    which would block dumb CM scaling.
    Otherwise, eg. LZMA+bt4 needs 12.5GB of RAM for 1GB window.
    There's no point in limiting the memory to 1GB or some such.

  13. #12
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    1) Its very hard to compare different codecs with MT, when speed is taken into account,
    especially on different operating systems.
    Agreed.

    2) Proper MT design, precise load balancing to ensure 100% cpu load is a complex task on its own,
    and this a compression contest rather that "efficient MT job manager" contest.
    I am not following the argument here. How is the complexity relevant ? The goal is to utilize the CPU resources at hand. Would you say "rather than say, AVX/AVX2 efficiency contest" ?
    Efficiency is for the codec writer to worry about, not the end user. Besides an efficient job manager does not guarantee good or fast compression.

    1) Its hard to set strict memory limits - memory usage changes during processing,
    sum of allocated block sizes is different from number of actually mapped pages,
    its easy enough to implement custom swap via guard pages, etc.
    Hard to enforce but not too hard to measure.

    3) Sponsor is not interested in usual kinds (blockwise) of MT codecs at all.
    They would like it if somebody managed to MT small-block compression,
    but threading has high overhead, so cases where splitting the work within a 32kb block
    would improve compression speed are unlikely.
    Fair enough.


    2) There's a low limit on allowed speed (4000s for 1GB = ~250KB/s)
    which would block dumb CM scaling.
    Otherwise, eg. LZMA+bt4 needs 12.5GB of RAM for 1GB window.
    There's no point in limiting the memory to 1GB or some such.
    I did not write about a 1GB limit but what if compressors use 20+GB of RAM to win the contest ? How would that be useful ?

  14. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > I am not following the argument here. How is the complexity relevant?
    > The goal is to utilize the CPU resources at hand.
    > Would you say "rather than say, AVX/AVX2 efficiency contest" ?

    Actually yes? There're 100s of codecs on the net, but very few of
    them really make use of SIMD or MT.
    In fact there's still no reliable portable way to do MT and SIMD everywhere -
    some compilers don't support intel intrinsics, some don't have pthreads.
    OpenMP is bloated and has high overhead, std::thread is not supported by mingw, etc.

    However SIMD is not a auto-win feature in compression benchmarks
    (and compilers can do automatic vectorization), while MT kinda is.

    > Efficiency is for the codec writer to worry about, not the end user.
    > Besides an efficient job manager does not guarantee good or fast
    > compression.

    You're forgetting open-source codecs.
    I mean, it would be possible to just implement efficient blockwise MT
    for some open-source library (like eg. fast-lzma2) and potentially win.
    Without any compression-related experience at all.

    In any case, based on known benchmark results, we decided that
    MT would break the competition - very few people would want to write it,
    then everybody else would lose motivation.

    As I said, its supposed to be a compression competition.
    The main point should be improvement of compression algorithms,
    not some platform-specific speed optimization tricks.

    > Hard to enforce but not too hard to measure.

    I have an utility like that for windows.
    In loads an executable in debug mode and tracks memory usage
    in the handler of debug events.
    Its basically what would be necessary to measure memory usage like you suggest.

    Problem is, it affects processing speed, so memory usage check
    has to be done on a separate run.
    But 4000s is 1.1 hours, so testing one program with 20 compression levels
    can easily take a day on its own.

    In short, we discussed it, but Maxim thinks that memory checks
    would add too much load for him.

    > I did not write about a 1GB limit but what if compressors
    > use 20+GB of RAM to win the contest ? How would that be useful ?

    We think that it is reasonable to let codecs index the whole (1GB) file.
    Depending on data structures, it can easily require >10GB for some algorithms
    (LZ or PPM mainly).
    Originally suggested memory limit was 16GB, but since test machine has 32GB,
    why not use it, it doesn't really change anything.

    As I already mentioned, the only algorithm where more memory directly
    increases compression is CM, but there it also reduces the speed.

    Otherwise I'm not aware of any tricks to easily improve compression
    by adding more memory after the amount necessary to index the whole file.

    Also if some algorithm is unable to index the whole file, and loses in compression because of that...
    well, I think its ok, because RAM size would keep increasing in the future.

    Btw, there also supposed to be the blockwise compression task,
    where memory size won't matter at all.

  15. #14
    Member lz77's Avatar
    Join Date
    Jan 2016
    Location
    Russia
    Posts
    127
    Thanks
    38
    Thanked 13 Times in 9 Posts
    Will you publish every month best f for rapid compression (f=c_time+A×d_time+B×c_full_size)?

  16. #15
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    There're 100s of codecs on the net, but very few of
    them really make use of SIMD or MT

    You are kind of making my point that we need more codecs utilizing modern CPU features (outside this contest).

    The main point should be improvement of compression algorithms,
    not some platform-specific speed optimization tricks.
    MT is not platform-specific optimization "tricks".

    In fact there's still no reliable portable way to do MT and SIMD everywhere -
    some compilers don't support intel intrinsics, some don't have pthreads.
    OpenMP is bloated and has high overhead, std::thread is not supported by mingw, etc.
    It has not been my experience with MT C++ code. I have seen no portability issues across compilers (gcc, clang, mingw, msvc) or OSes (windows, linux, macOs, Android).
    You know very well that you can address these portability problems at compile time or runtime (by checking CPU capabilities), so I do not think it is a real issue.

    In short, we discussed it, but Maxim thinks that memory checks
    would add too much load for him.
    So, this is the rationale. It is a valid concern.

  17. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > You are kind of making my point that
    > we need more codecs utilizing modern CPU features (outside this contest).

    I agree, but I think that the right way would be to formulate specific
    tasks for SIMD (vector implementations of specific functions; x86 and ARM separately),
    MT (best job manager for a specific codec framework), maybe best
    implementations of various components (matchfinders etc).

    Unfortunately the setup of each of these tasks would take considerable time
    (task spec, samples, test scripts, baseline code), and then very few
    people would be interested in solving them, and there'd be no sponsor
    to pay for the solutions.

    In any case, >90% of existing codecs don't have MT support,
    and we felt that its a bad idea to exclude them.
    One possible approach would be to provide our own MT framework
    for codecs which don't have it, but that would complicate
    the entry requirements (dll/so+API instead of console coder).
    As you can see, we still can't release even block test API,
    which is similar, but much simpler.

    > MT is not platform-specific optimization "tricks".

    Unfortunately it is - there're different cache sharing rules, sync costs etc.
    Splitting the task for N threads doesn't mean N times faster processing,
    and MT algorithm working efficiently on one platform doesn't mean that it
    would be the same on another (or work at all).

    > It has not been my experience with MT C++ code. I have seen no portability
    > issues across compilers (gcc, clang, mingw, msvc) or OSes (windows, linux,
    > macOs, Android).

    I'm happy for you if its true, maybe you have an established framework which
    solves the issues for you.
    My experience is different though: https://encode.su/threads/2606-msufs...ll=1#post50476

    > You know very well that you can address these portability problems at
    > compile time or runtime (by checking CPU capabilities), so I do not think
    > it is a real issue.

    The issue is that there's no cheap default solution for portable MT in C/C++.
    There're openmp, boost and std::thread basically, but first two are too large and bloated,
    and std::thread is not reliable on Windows atm (neither VS implementation nor mingw workarounds).

    Sure, with some effort its possible to create a portable MT library for your project.
    But you have to understand that there're plenty of people who don't have relevant skills,
    nor interest in that kind of work.

  18. #17
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Is there an age limit to join or receive the contest price?

    I miss something like:
    The contest is not open to individuals associated with the contest or with access to the not public contest data, including employees, agents or representatives and suppliers providing prizes or the immediate family members of the excluded individuals, and all other persons with whom the excluded individuals reside.

  19. #18
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    To give everybody a fair play, the text what abruptly end in the public test 1 file TS40.txt:

    "“What the devil do you mean by that?” cried James Clareborough, savagely, as he tried to reopen the door, b"

    This is how it probably go to end:

    "ut his brother placed his back to it and held him off."

    http://www.gutenberg.org/files/32881...-h/32881-h.htm

  20. #19
    Member
    Join Date
    Dec 2019
    Location
    Japan
    Posts
    3
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Troll Entry:

    Test hardware bandwidth speed?
    Upload -> URL Output (Compressed Data) -> Download (Uncompressed Data)

    Pardon English

  21. #20
    Member
    Join Date
    May 2017
    Location
    UK
    Posts
    13
    Thanks
    1
    Thanked 3 Times in 3 Posts
    why no scientific data, e.g. floats, doubles in the test set?

    Is this 1990s all over again? With the advent of machine learning and huge analytics data optimization in engineering, this is the biggest challenge we have, not squeezing an extra 0.01% out of wikipedia compression. At the moment, hardly any compressors work well (as well as fast) with floating point, the exception being oodle.

  22. #21
    Member Alexander Rhatushnyak's Avatar
    Join Date
    Oct 2007
    Location
    Canada
    Posts
    250
    Thanks
    46
    Thanked 105 Times in 54 Posts
    Quote Originally Posted by djubik View Post
    why no scientific data, e.g. floats, doubles in the test set?
    There are four test sets, and two of them contain some floats and doubles.
    As for the Quantitative Data set, yes, this year it contains only integers,
    but I believe next year there will be doubles for sure, maybe floats too.
    No surprise that you have nothing against only English in the Qualitative Data set.

    This newsgroup is dedicated to image compression:
    http://linkedin.com/groups/Image-Compression-3363256

  23. #22
    Member
    Join Date
    May 2008
    Location
    France
    Posts
    83
    Thanks
    547
    Thanked 27 Times in 19 Posts
    First change:

    Quote Originally Posted by Rules 2020-06-17.1653
    Since we focus on universal approaches, we exclude compressed data because processing it efficiently may require specialized decompression and recompression. Also, we use delta-N for 8 KiB–long blocks—where N varies within {1…8, 12, 16}, — to determine whether the estimated fraction of quantitative data is big enough on the basis of block compressibility after delta-N.
    >
    Quote Originally Posted by Rules 2020-06-24.1143
    Since we focus on universal approaches, we exclude compressed data because processing it efficiently may require specialized decompression and recompression. Also, we filter data to ensure that the test contains comparable estimated percentages of quantitative and qualitative data.

  24. #23
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    >Note that we use and will use a RAM disk for file I/O.

    How much memory is used by RAM disk?

  25. #24
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    107
    Thanks
    37
    Thanked 29 Times in 20 Posts
    Quote Originally Posted by Sportman View Post
    >Note that we use and will use a RAM disk for file I/O.

    How much memory is used by RAM disk?
    Ram disk is dynamically allocated. So it will be input size + output size assuming nothing else is in ramdisk.

  26. #25
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by algorithm View Post
    Ram disk is dynamically allocated.
    2GB Windows 10 + 2GB RAM disk = <28GB left.

  27. #26
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts
    Quote Originally Posted by Shelwien View Post
    I have an utility like that for windows.
    In loads an executable in debug mode and tracks memory usage
    in the handler of debug events.
    Its basically what would be necessary to measure memory usage like you suggest.

    Problem is, it affects processing speed, so memory usage check
    has to be done on a separate run.
    But 4000s is 1.1 hours, so testing one program with 20 compression levels
    can easily take a day on its own.

    In short, we discussed it, but Maxim thinks that memory checks
    would add too much load for him.
    If we have enough machine time, we will check memory consumption for reference. If not, then not.
    At the moment it's problematic to forecast.

  28. #27
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts
    Quote Originally Posted by Alexander Rhatushnyak View Post
    No surprise that you have nothing against only English in the Qualitative Data set.
    Well, I was against only English

  29. #28
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts
    Quote Originally Posted by Noragami View Post
    Troll Entry:

    Test hardware bandwidth speed?
    Upload -> URL Output (Compressed Data) -> Download (Uncompressed Data)
    Excuse me, what do you mean?

  30. #29
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts
    Quote Originally Posted by lz77 View Post
    Will you publish every month best f for rapid compression (f=c_time+A×d_time+B×c_full_size)?
    Yes, it is supposed to happen.

  31. Thanks:

    lz77 (28th June 2020)

  32. #30
    Member
    Join Date
    Apr 2020
    Location
    Russia
    Posts
    25
    Thanks
    0
    Thanked 23 Times in 8 Posts
    Quote Originally Posted by Sportman View Post
    This is how it probably go to end:

    "ut his brother placed his back to it and held him off."

    http://www.gutenberg.org/files/32881...-h/32881-h.htm
    Quite likely.

    Quote Originally Posted by Sportman View Post
    Is there an age limit to join or receive the contest price?

    I miss something like:
    Formally, there is that "Members of the competition committee are ineligible for awards." condition.
    Practically, (I think) it's very unlikely that the persons involved will be doing something you hinted.

Page 1 of 4 123 ... LastLast

Similar Threads

  1. GDC Competition: Notices
    By Ms1 in forum Data Compression
    Replies: 6
    Last Post: 28th August 2020, 21:12
  2. Competition on compression ($)
    By Sshingen in forum Data Compression
    Replies: 8
    Last Post: 10th August 2013, 06:02
  3. Compression Competition -- $15,000 USD
    By Fixee in forum Data Compression
    Replies: 153
    Last Post: 27th March 2013, 15:38

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •