Results 1 to 7 of 7

Thread: Calgary challenge

  1. #1
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts

    Calgary challenge

    Anybody notice this? http://mailcom.com/challenge/
    It was posted about 2 months ago.

    Alexander Rasushnyak has a new entry of 580,170 bytes. I tested it and it decompresses to the Calgary corpus in 1054 sec. on a 2 GHz T3200 using about 580 MB memory (g++ 4.5.0 -O2). Source code looks like modified PAQ8 with a small dictionary for text encoding.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,373
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Didn't notice, thanks.
    I'd appreciate some more specific info though.
    Like what kind of models it uses (beside text ones), compressed sizes of CC files, etc.

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,373
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Code:
               paq8px69  [589862]  [580170]  
    book1        191594   183296    181684
    book2        116980   112705    110762
    paper1        10314    10145      9935
    paper2        16633    16067     15890
    news          82568    82017     79665
    bib           18261    18702     17935
    trans          9785    10302      9808
    progc          8196     8366      8055
    progl          9399     9780      9316
    progp          6660     6791      6520
    obj1           7183     7507      7337
    obj2          43865    44865     43805
    geo           44283    44101     43739
    pic           30825    21796     22129
    So it looks like its a text model improvement again (although its cool too).
    But I have to note that the known best result for geo is 40047 and for pic its 21535,
    and decoder source is not completely obfuscated, so I wonder if we'd see 570k too,
    eventually.

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Actually there are 2 separate streams for the first 5 files. The second stream is 5702 bytes which is appended to the end of the compressed file. The program reads this into an array bb[5702] at the beginning, pointed to by sp. I'm not sure what it is used for, but I guess it might be end of line encoding in a separate data stream. Here are the compressed sizes from both streams not including the 183 byte plain text header. So really book1 is 184573 bytes.

    Code:
    book1   181688  2885
    book2   110762  2110
    paper1  9935    81
    paper2  15890   157
    news    79665   469
    bib     17935   0
    trans   9808    0
    progc   8055    0
    progl   9316    0
    progp   6520    0
    obj1    7337    0
    obj2    43805   0
    geo     43739   0
    pic     22129   0
    Also, it looks like from the code that it uses dictionary coding (1 byte codes) and capitalization symbols both for single chars and whole words.

    I believe that JBIG2 compresses PIC better than any PAQ version. Sample 5 from http://digit.nkp.cz/knihcin/digit/va...T_samples.html is actually PIC compressed to 18,230 bytes with DjVu. The compressor extracts a set of fonts from the image and then replaces character images with font codes plus bit differences. Looking at the code, picModel is unchanged from earlier versions of paq8 (maybe paq8l) that was removed from paq8px.

    There are some obvious changes such as removing useless models like exe, bmp, wav, jpeg. But there seem to be lots of new contexts in wordModel. Also, for pic, picModel is turned on and most of the text modeling is turned off. The other models are left on for binary files (like geo) but there is a global flag (tf) that is used as context to indicate the input is text (first 10 files).

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,373
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    I know of that pic lookalike, but its likely a different scan of the same page or something.
    And Shkarin's bcdr compresses it to 18829 bytes without messing with fonts (paq8p to 20194).
    Also as to approach itself (detecting letters etc) in theory you're right, but likely it won't work
    out for Calgary Challenge - there'd be too much source code to compensate.

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    You're right. It is not the same image. It looks like a different scan of the same page. f05_200.djvu is 1728 x 2339. pic is 1728 x 2376. Also you can compare the attached portion of f05_200 with the same region of pic from http://mattmahoney.net/dc/dce.html#Section_21 and you can see lots of pixel differences around the edges of the characters and lines. (I created the attached image by using DjVu to convert to bmp, zooming in Windows file viewer and capturing with snipping tool).
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	pic.png 
Views:	217 
Size:	16.3 KB 
ID:	1379  

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,373
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Btw, this is the mentioned Shkarin's coder: http://nishi.dreamhosters.com/u/bcdr_sh2.rar
    Original one is http://compression.ru/ds/bcdr.rar but it can't be compiled without some fixes.

    Code:
    513278 - pic.bmp
    505286 - f05_200.bmp
    21616 - pic.bcd
    18990 - f05_200.bcd

Similar Threads

  1. Calgary Corpus
    By LovePimple in forum Download Area
    Replies: 0
    Last Post: 31st July 2008, 22:55

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •