Results 1 to 17 of 17

Thread: FCM3

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts

    FCM3

    OK, a closed-source, release candidate version is here:
    http://encode.su/downloads/fcm3-RC.zip

    Brief description:
    It uses two dynamically mixed order-1 models. First model uses a regular order-1 context. The second one, as a context uses a predicted char by order-5 LZP. The LZP model keeps 2^25 characters (32 M).


  2. #2
    Member
    Join Date
    May 2008
    Location
    Estonia
    Posts
    395
    Thanks
    148
    Thanked 225 Times in 123 Posts

    Post

    timer fcm3 c enwik9 enwik9.out
    Kernel Time = 4.546 = 00:00:04.546 = 1%
    User Time = 284.015 = 00:04:44.015 = 93%
    Process Time = 288.562 = 00:04:48.562 = 95%
    Global Time = 302.704 = 00:05:02.704 = 100%

    timer fcm3 d enwik9.out enwik9.res
    Kernel Time = 5.125 = 00:00:05.125 = 1%
    User Time = 266.828 = 00:04:26.828 = 96%
    Process Time = 271.953 = 00:04:31.953 = 98%
    Global Time = 275.156 = 00:04:35.156 = 100%


    Result size 305.331.431 bytes
    KZo


  3. #3
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Thanks for testing!

  4. #4
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    Thanks Ilia!

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Does anyone have any ideas about further LZP-predictor improvement?

    AKAIK, we have many options:

    1. Use variable orders with context confirmation - i.e.:
    order-8
    order-4
    order-2

    I tested such thing with FCM - the idea works but in this case we get just slight compression gain. Order-8 and higher mostly helps on text files.

    2. Keep a few pointers into the buffer - a la ROLZ, and do a backwards matching, select the longest one - followed char is our predicted char. BTW, such approach used in UHARC PPM's LZP layer - but of course UHARC uses match lengths etc. instead of just symbol prediction. Again such idea was tested, but with no luck - probably due to poor/incorrect implementation - at least I tried order-2 and hashed order-4 contexts - and generally speaking the thing not works... Well, I think I should do "buffer free" implementation. However E8/E9 needs a buffer...

    3. With each context keep counter of successul predictions. And replace char only of succ<=0; Such thing was used by Charles Bloom in his PPMZ. Not tested yet...


  6. #6
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You could try the symbol ranking idea i advertised in the CM coder thread, since it identifies highly redundant data and allows you to compress a significant amount of data down to almost nothing.

    It doesn't need a dictionary.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    I also mentioned the idea of SSE flags like (order8[].c==order4[].c) etc.

  8. #8
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Yep, i haven't invented it, but successfully used it

    Maybe one could join this using a 2d SSE as a mixer? This would join the mixing and SSE stage, so lowering the complexity. But additional SSE after this would help too.

  9. #9
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Yep, but we go beyond simplicity here. Having said I also tried keeping two LZP models (a higher and a lower order one) and do a dynamic mixing of both along with pure order-1. It worked. But at the moment I'm interested in how we may more accurately predict the character.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    For accurate predictions you'd need all statistics from all contexts,
    not only a single last occurence.
    And btw, stop calling LES (last encountered symbol or something) a
    rank0 as its misleading.
    Rank0 in symbol ranking is a first guess, that's usually a global rank
    and its not even necessary for most probable symbol to have it
    (of course same applies to LES).

    Edit: As I mentioned, it might be helpful to log the code lengths
    or probabilities from eg. paq8 and first check the cases where
    you model has relatively worst prediction.
    Last edited by Shelwien; 16th June 2008 at 22:44.

  11. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Tested the idea with flags - i.e. instead of mixing, encode a flag - predicted/mispredicted. Mispredicted symbols are coded using pure order-1 coder. Well, the compressor became lightning fast! In most cases we lost some compression, however, not too much, and in some cases we even gained compression...

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Well, that's unary coding.
    But now you probably have some redundancy due to possibility of encoding
    the same symbol with two different intervals.
    Well, paq9a does it the same way, so you're not alone.

    To remove this redundancy you have to either merge the symbol's probability
    from the byte distribution into the flag probability (p_flag += (1-p_flag)*p[cc])
    or just discard it. And either way you have to mask it in the distribution and
    renormalize that... or at least skip the encoding of last bit of the value сс^1.

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Theoretically, we may use a data structure like with PPMd, to access to very high orders, and just traverse backwards outputting flags - much like PPM's Escape Flag.

    One thing I still wanna try is a backwards matching idea properly implemented... I'm thinking it should work, at the same time in this case we will have access to unlimited context lengths.

  14. #14
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Unlimited context lengths -> PPM*. I'm not sure, but somewhere i read about PPM implementations doing this (backward matching).

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    There's no reason to, as its much faster to access the suffix from the longest context.
    Eg., if you have the context "abc" and didn't find your symbol in it, then
    you can switch to context "bc" using a pointer stored in the "abc" node.
    Then, if you found your symbol in "bc" (say "d"), that means that you
    already have a pointer to the next context "bcd".
    Thus, context order can be either decreased by following the suffix link,
    of increasing by symbol link.

    While your "backward matching", if I understood you right, requires
    parsing the whole tree per each symbol, or computing a hash function
    per order per each symbol.

  16. #16
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    My idea, which I'm implementing right now:

    Keep buffer of input data

    Keep a few pointers of the same context/hash into buffer - a la ROLZ

    Check each buffer position for the longest backward match

    Grab the longest one, the followed char is our predicted char

    Check if we guess correctly, if all correct, drop predicted flag

    If we mispredicted char, drop mispredicted flag plus an actual order-1 coded char

    Wait an hour - I post some testing results...

  17. #17
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    DONE!

    Well, generally speaking this idea not works.

    Check out some testing results:

    order-2
    Using order-2 context with no hashing. Backward match length was not limited and in case of long repeated byte sequence the compressor freezes. In practice we may limit match length to some large number. Results are listed with variable number of offsets stored.

    world95.txt:
    256: 782,667 bytes
    512: 740,807 bytes
    1k: 710,514 bytes

    calgary.tar:
    256: 996,185 bytes
    512: 978,428 bytes
    1k: 968,160 bytes

    fp.log:
    256: 1,039,541 bytes
    512: 988,773 bytes
    1k: 964,333 bytes

    order-4
    Just for testing purposes, a hashed order-4 (20-bit hash) with different number of offsets stored, as previously.

    world95.txt:
    16: 768,512 bytes
    32: 739,743 bytes
    64: 717,453 bytes

    calgary.tar:
    16: 1,018,008 bytes
    32: 1,015,644 bytes
    64: 1,011,131 bytes

    fp.log:
    16: 990,640 bytes
    32: 974,222 bytes
    64: 961,020 bytes

    Mostly, order-2 is better, but slower, compared to a higher orders since we have a larger number of offsets to check.

    So anyways, straight ROLZ scheme is better...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •