Results 1 to 6 of 6

Thread: Hutter Prize Question

  1. #1
    Member Bullbash's Avatar
    Join Date
    Mar 2019
    Location
    Canada
    Posts
    16
    Thanks
    8
    Thanked 0 Times in 0 Posts

    Hutter Prize Question

    Hi Everybody,

    not in compression myself, more in patterns mining... mercy please.

    Would something like a list of high frequency skip-n-grams be helpful in text (whatever) compression (Envik9 freq, image) ?

    3040805 ~ and ~ ~
    1407638 ~ ~ ~ id
    1245093 ~ of the ~
    551192 the ~ ~ the
    489215 ~ revision ~ ~
    487628 ~ ~ contributor ~
    412516 at ~ ~ ~
    410954 username ~ ~ ~
    369067 ~ ~ ~ &amp
    361386 ~ ~ ~ have
    356645 ~ which ~ ~
    332079 ~ sup2 ~ ~
    293918 title ~ ~ id
    261460 the ~ of the
    244682 page ~ title ~
    **********************
    10116 [[Image ~ of ~
    10092 ~ space ~ The
    10089 ~ y ~ ~
    10074 7 mi&amp sup2 ~
    10050 in the city is
    10042 ~ nearly ~ ~
    10035 ~ ~ by his
    10032 ~ ~ actress ~
    10019 used by ~ ~

    against the rules I guess, just wondering... There are 11K 4-word grams appearing more then 10K times in Envik9.
    Can do longer, though patterns would be less frequent.
    Call it domain specific dictionary.
    Thank you!

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,138
    Thanks
    320
    Thanked 1,401 Times in 803 Posts
    There're some contexts like that in paq8px: https://github.com/hxim/paq8px/blob/.../Info.cpp#L367
    But that's an adaptive model - its used for prediction while its still training.

    If you want to store that kind of frequency table - sure, that's still possible,
    but the size of this table would be still counted as part of final result (along with decoder executable size).

    Then there's also a question of how exactly you would make use of these patterns.
    Normal (serial) encoding can only use a part of this information (when (n-1)-gram says the next word has to be this one).
    We could use one word of context, then enumerate all combinations of (n-1) words ahead, exclude impossible combinations,
    then sum up the probabilities - but that would be crazy slow.

  3. #3
    Member Bullbash's Avatar
    Join Date
    Mar 2019
    Location
    Canada
    Posts
    16
    Thanks
    8
    Thanked 0 Times in 0 Posts
    Thank you. Beside being "crazy slow" - can you do something like Huffman encoding? if you replace just one *global* pattern of 10 bytes occurring 100K times with something [very] short - it would remove ~1Mb of that Envik thing. Do it twice and you got an improvement ~2% (ok, less in reality). Compress the rest as you, magicians, usually do. You beat the H-Prize. But I give around 1K of such useful patterns... The big idea is to remove the frequent patterns, compress the dictionary, compress patterns locations, compress the remainder - all separately. Makes sense or not?

    Forget about "slow" for a sec - remember the origins of the competition... Markus Hutter is/was looking for a model of human knowledge representation. Humans learn slowly...

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,138
    Thanks
    320
    Thanked 1,401 Times in 803 Posts
    Unfortunately it doesn't work like that, and we're like 50 years past the stage of huffman coding.
    Compression-wise it would be _always_ better to just use arithmetic coding instead of huffman.
    And reducing the size of input data won't necessarily also allow for better compression.
    Consider this example:
    Code:
    768,771 BOOK1
    261,220 BOOK1.7z
    216,021 BOOK1.pmd
    184,121 BOOK1.paq8px197
    
    643,220 BOOK1_no_space
    252,976 BOOK1_no_space.7z
    219,554 BOOK1_no_space.pmd
    199,364 BOOK1_no_space.paq8px197
    BOOK1 is a plaintext english book. I removed the spaces from it and compressed both original and no-space version with some compressors.
    .7z uses LZ77 compression (LZMA2 to be specific), which has no real context model, so it behaves as expected - less bytes on input, less on output.
    But simple PPM statistical model in .pmd adds redundancy instead, and complex model in paq8 adds even more.
    This happens because spaces were actually useful as context, and removing them actually added noise to the information contained in the file.

    I'm quite sure that similar stuff would happen if we'd try to naively replace word patterns with some codes
    (or even just remove the patterns from the main file, and add the necessary information to some secondary file).
    Removing words from text breaks all kinds of structures, so statistical model would mispredict the "future" data at the point of replacement,
    then would apply incorrect statistics at all following occurrences of "breakpoint contexts".

    Beneficial replacements are actually possible, but you have to be careful to preserve the structure of the data
    (at least types of structure visible to compressor's model).
    https://encode.su/threads/3072-contr...p-tools-for-HP

    I suppose, making a model which would take into account the "breakpoints" and work around them
    should be also possible, but its like deciding to add "breaking encryptions" when trying to write a compressor -
    sure, this feature can be useful in some cases (including better compression of encrypted files),
    but normally it is simply not relevant for the main purpose, but makes the programming 10x harder.

  5. #5
    Member Bullbash's Avatar
    Join Date
    Mar 2019
    Location
    Canada
    Posts
    16
    Thanks
    8
    Thanked 0 Times in 0 Posts
    "Being doing some thinking" (c).

    Removing a single symbol (space) is not a good idea at all by two reasons:
    a) "spaces were actually useful as context, and removing them actually added noise to the information contained in the file" - agree
    b) a table of "spaces" locations will be too big (I guess) - even compressed.

    But removal of a bigger pattern might make sense:

    Code:
    Sources:                                                                                     
    
    envik9.txt                                                      - 1.000.000.000                  
    envik9 (without "<contributor> and </contributor>)              -   993.427.498 (- ~ 6.5Mb)
    envik9 (without "<*contributor> and surrounding white space>    -   990.019.584 (- ~10 Mb)
    
    paq8pxd_v5 compression:
    
    146.118.667
    146.127.114 (bigger ~8K)
    146.111.209 (smaller ~7K)
    
    
    Though the locations table still be bigger then 7K saved - there is a trend: salient/complex patterns could help in compression.
    If I wanted badly, I'd find a few which actually could make a smaller archive.

    The point: mining complex patterns is a NP-hard problem, I'm just looking for a good algo approximation.
    It does "makes the programming 10x harder" and it is beyond the Hutter competition rules.

    I do believe that human memory is built as hierarchy of bigger and bigger patterns - which is another story.

    Piece!
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	1.png 
Views:	15 
Size:	7.6 KB 
ID:	8191  

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,138
    Thanks
    320
    Thanked 1,401 Times in 803 Posts
    > b) a table of "spaces" locations will be too big (I guess) - even compressed.

    That's actually not a problem, since space flags can be compressed in context
    of main file - it can actually improve compression of spaces (comparing to compressing them inline),
    since in this case its possible to also use "future" right-side contexts.

    > But removal of a bigger pattern might make sense

    I agree, and I made a whole project about that, linked in my previous post.
    However you also have to understand that enwik has an XML index, which can be
    seen at file start, so people frequently start by targeting it.
    But this index actually compresses very well even as is, and compresses even
    better with dedicated preprocessors/generators:

    https://encode.su/threads/2590-Some-...enwik8-parsing
    https://encode.su/threads/2590-Some-...ll=1#post59399
    https://encode.su/threads/3082-how-t...numbers-better
    https://encode.su/threads/2924-Faste...ssion-(for-fun)

    Actual main issues are:
    1) wiki links like [[x]] and [[x|y]] - they're sometimes placed in weird ways (eg: mega[[joule]]s)
    so its not enough to just encode a single flag per word.
    2) bibliography and reference sections
    3) actual english text

Similar Threads

  1. Hutter Prize update
    By Matt Mahoney in forum Data Compression
    Replies: 77
    Last Post: Today, 02:19
  2. Hutter Prize, 4.17% improvement is here
    By Alexander Rhatushnyak in forum Data Compression
    Replies: 91
    Last Post: 19th December 2020, 23:22
  3. Replies: 1
    Last Post: 11th December 2020, 16:08
  4. Hutter Prize submission
    By Matt Mahoney in forum Data Compression
    Replies: 30
    Last Post: 26th October 2017, 21:29
  5. The Hutter Prize
    By LovePimple in forum Forum Archive
    Replies: 7
    Last Post: 22nd September 2006, 13:28

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •