Page 1 of 2 12 LastLast
Results 1 to 30 of 31

Thread: PREDICTOR algorithm

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Just recalled an old-school algorithm called PREDICTOR.

    Some info can be found here:
    http://rfc.net/rfc1978.html

    The idea is:

    Keep the "prediction table", which maps hash->symbol.

    Each time we check current symbol with one stored in table; if we guess correctly, just output one bit, if not - bit+unguessed byte. Just like one-byte LZP.


  2. #2
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    http://www.ietf.org/rfc/rfc1978.txt

    works exactly as my tarsalzp i have two models for lzp + one model for literals (unguessed bytes).

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,365
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    that's symbol ranking

  4. #4
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    ppp ranks #88 on LTCB (just below fpaq0f2). http://cs.fit.edu/~mmahoney/compression/text.html#5793

    I call it LZP. LZP is symbol ranking with a queue size of 1, or if you prefer, ROLZ with a dictionary size of 1.

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,365
    Thanks
    212
    Thanked 1,016 Times in 540 Posts
    > I call it LZP. LZP is symbol ranking with a queue size of 1, or if you prefer,
    > ROLZ with a dictionary size of 1.

    Well, LZP's definition is this: http://cbloom.com/src/index_lz.html
    So guess I'm all wrong here.

    But still I'd prefer to call LZP a LZ77 algorithm with offsets (filtered
    by current suffix) and match lengths, because LZ is supposed to be a
    fast asymmetric algorithm, while symbol ranking is symmetric and not so
    fast and has further generalizations.

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    By the way, for ASM lords:

    Code:
     
    procedure decompress; 
    asm 
      lea esi,c 
      lea edi,buf 
      lea ebp,tab 
      mov ecx,n 
      xor ebx,ebx 
      mov dx,1 
    @@1: 
      cmp dx,1 
      jne @@2 
      lodsb 
      mov dl,al 
      add dh,1 
    @@2: 
      test dx,1 
      jnz @@3 
      lodsb 
      mov [ebp+ebx],al 
      jmp @@4 
    @@3: 
      mov al,[ebp+ebx] 
    @@4: 
      stosb 
      shl ebx,5 
      xor bl,al 
      and ebx,$fffff 
      shr dx,1 
      loop @@1 
    end;

    It's my ASM driven PREDICTOR. Like you see it's pretty small and cool. Looks like I will add it to RASH project...

  7. #7
    Member
    Join Date
    Jan 2007
    Location
    Moscow
    Posts
    239
    Thanks
    0
    Thanked 3 Times in 1 Post
    Quote Originally Posted by encode
    By the way, for ASM lords:
    They sit in the wasm.ru forum

  8. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Quote Originally Posted by Shelwien
    Well, LZPs definition is this: http://cbloom.com/src/index_lz.html
    So guess Im all wrong here.
    No, youre right. It is symbol ranking, unless you want to call it LZP with a fixed match length of 1

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    BTW I looked at the LZP programs. They all failed on enwik9 (out of memory) but worked on enwik8. The program tries to allocate buffers for input and output all at once.
    http://cs.fit.edu/~mmahoney/compression/text.html#f330

    Also, lzp1 and lzp2 crashed on calgary.tar. lzp3o2 decompressed it correctly, then crashed on exit. No problems on enwik8.

    I tried compiling but there are a lot of problems with crblib with gcc and gave up. lzp2 didn't include a decompressor.

    I also looked at ppmz2 but it apparently only runs in a test mode

  10. #10
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    By the way, Matt, your hash function from SR2 easily can fit into PPP-like algorithms.

    h = (h * (5 << 5)) + c + 1;

    5 = 101 in binary notation

    thus:

    h = ((h << 5) + (h << 7)) + c + 1;

    actually we may eliminate "+ 1", in my tests:

    h = (h * (5 << 5)) + c;

    has about the same performance.

    also we may do:

    h = (h * (5 << 5)) ^ c;

    This one has a better performance on binary files. In addition, for binary files we may use an order-3:

    h = (h * (5 << 7)) ^ c;

    In this case the table size should be 1<<21.

    7+7+7=21

    Any ideas about other sliding hashes?

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Yes, c + 1 is probably not needed. I used it for better hashing of strings of zero bytes.

    g++ does a nice optimization of h * (5 << 5):
    Code:
     
    leal    (%eax,%eax,4), %eax 
    sall    &#036;5, %eax

  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Intel C++ does the same:
    Code:
     
      lea ebx,[ebx+ebx*4] 
      shl ebx,5
    So, will use this ASM code...

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    ...and VC2005 does it in the same way...

  14. #14
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    I also noticed that no one uses LOOP opcode.

    In Delphi source we often can see:

    DEC ECX
    JNZ @LOOP

    In some optimizations guidelines we may find something like that:

    SUB ECX,1
    JNZ @LOOP

    But Intel C++ does even smarter:

    ADD ECX,-1
    JNZ @LOOP


  15. #15
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Why should add/sub be better? add/sub requires an extra immediate operand (in this case).
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  16. #16
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    It's not so simple. We face with some hardware specific stuff (INC/DEC modifies some additional flags and so on - hidden stuff).
    Read this:
    pentopt.pdf


  17. #17
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks! Last time i read about asm optimization was in the days of 386/468 . To answer my question:

    p. 84
    "For historical reasons, the INC and DEC instructions leave the carry flag unchanged, while
    the other arithmetic flags are written to. This causes a false dependence on the previous
    value of the flags and costs an extra uop. To avoid these problems, it is recommended that
    you always use ADD and SUB instead of INC and DEC. For example, INC EAX should be
    replaced by ADD EAX,1."

    BTW: After reading about hashes i implemented the "predictor" algorithm and i found an implementation on the net from, say 1989. It used 12 bit hashes and order 2 contexts.

    I made an extremly fast decoder using a jump table (asm): have 16 functions (without stack frame) indexed by a group of 4 flags. function "0000" decodes four unprocessed characters .... function "1111" decodes four rank 0 symbols.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  18. #18
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    And hence, hash function like:
    Code:
     
      h=h*(5<<5)+c+1&0xfffff;

    In optimized asm looks like:
    Code:
     
      lea ebx,[ebx+ebx*4] 
      shl ebx,5 
      lea ebx,[ebx+eax+1] 
      and ebx,1048575

    The first LEA+SHL is equal to:
    IMUL EBX,160
    The second lea is equal to:
    ADD EBX,EAX
    ADD EBX,1


  19. #19
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by toffer
    I made an extremly fast decoder using a jump table (asm): have 16 functions (without stack frame) indexed by a group of 4 flags. function "0000" decodes four unprocessed characters .... function "1111" decodes four rank 0 symbols.
    Currently Im working on the same thing. I wrote a super small and fast decoder on ASM, primarily for my EXE-packer/crypter.

  20. #20
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Im aware of simpler things like using lea and shX. But my asm knowledge dosnt contain such tricky stuff like the one above (using add instead of dec) - since it involves knowledge of the newer pentium instruction handling.

    Quote Originally Posted by encode
    Currently Im working on the same thing. I wrote a super small and fast decoder on ASM, primarily for my EXE-packer/crypter.
    I meant that i did this some years ago (when i read about hashing the first time). Actually i used jmps instead of call/ret to avoid return address handling. Do you also try to use jump tables?
    If you use this approach you could add (binary) RLE to compress rank 0 runs without a big preformance hit, since a decoding function dosnt involve any branches.

    At university we once did a fun project, and realized some asm tasks ONLY using arithmetic and logic instrucions. The program became huge, but the supervisor was confused and couldnt follow how the source works
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  21. #21
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by toffer
    Do you also try to use jump tables?
    Nope. I think inline procedures and macros are better.

    And here is the test program packed with RASH (No anti-emulation tricks included.):
    test.exe

    This is a small app that displays and moves a large picture.

    The stub was written using Delphi, the decompress procedure written in a pure asm. The only reason I use asm here is that I can apply some anti debug and anti emulation tricks. In addition, hand tuned asm code is far more faster than Delphi-generated one.


  22. #22
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    My first thought was similar. But using a jump table completly kicks out any brances - in the end the jump table stuff turned out to be the fastest version i could get.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  23. #23
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    A long time ago, I noticed Delphi handles "case (...) of" structures as jump tables. That's cool isn't it?
    BIT Archiver homepage: www.osmanturan.com

  24. #24
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    This is nothing suprising. You should use case values instead of "if"s if your case values x1 x2... are continuous to allow such a transformation: jump to address[x-xmin]. Thats the reason for the existance of the case statements.

    Quote Originally Posted by toffer
    kicks out any brances
    Dont get this wrong i mean brachnes like: got match? decode match, otherwise decode character.
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

  25. #25
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by toffer
    got match? decode match, otherwise decode character.
    Here I use CMP/TEST + JNE/JNZ stuff, no external functions. All ASM code fits in a single procedure.

  26. #26
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    @toffer:
    Also, I saw jump tables in quake 2 source code. There are 8 different options for calculating distance between an axis-aligned bounding box and a plane. This must be as fast as possible. Because, thousands of calculation is needed per second! I think, jump tables are very effective. Keep in mind, game algorithms must be much faster than compression algorithms. Because, games are running in real-time.
    BIT Archiver homepage: www.osmanturan.com

  27. #27
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    thats fast if branches are long, jump table is small (so it fits easily in cache and doesnt occupy much space so its not pushed out by other data), theres many branches ( >3 ) and theres no branch thats executed very often ( with > 50 % chance ).

    if theres a branch that has high probability (above 50 % or higher) then there should be normal jump for that, ie.
    <div class=""jscript""><pre>
    jcond .second
    .first
    { very often executed code }
    jump .out
    .second:
    jump jumptable[index]
    .out:
    </pre>[/QUOTE]


    jump table always generate a misprediction. if you use normal conditional umps processor tries to guess if there will be a jump or not (and it guesses pretty well).

    [QUOTE]<div class=""quoting"">Quoting: toffer</div>Actually i used jmps instead of call/ret to avoid return address handling.</div>
    processor has special internal stack for function return addresses so it can guess the address of place to return without reading general stack. if you fool processor once then it flushes the entire internal stack and it cant predict further returns (before next calls).

    currently processors have many mechanisms to predict simple regularities in data and code flow so some tricks can fool them and decrease performance.

    optimization guidelines are here:
    http://agner.org/optimize/

  28. #28
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Code:
    currently processors have many mechanisms to predict 
    simple regularities in data and code flow so some tricks can fool 
    them and decrease performance.
    Yes, I agree. This is the answer why some assembly optimization of mine doesn't run fast as expected
    BIT Archiver homepage: www.osmanturan.com

  29. #29
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    so you should read how those mechanisms work luckily they are similiar in newer processors.

  30. #30
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by donkey7
    currently processors have many mechanisms to predict simple regularities in data and code flow so some tricks can fool them and decrease performance.
    I completly agree. I rarely use asm and my asm knowledege only covers "general" tricks and some mmx/sse

    Concerning branch prediction:

    Flag bits themself are decorrelated compared to the original pattern, the decorrelation has a high information density (-> compression), but the decorrelated data itself looks "random". I doubt that branch prediction can handle this.

    Quote Originally Posted by osmanturan
    Keep in mind, game algorithms must be much faster than compression algorithms. Because, games are running in real-time.
    I think this is wrong, since even my slow cmm3 does 400*1024*8 _expensive_ coding operations a second. Ive never seen a game running at 3276800 fps!
    M1, CMM and other resources - http://sites.google.com/site/toffer86/ or toffer.tk

Page 1 of 2 12 LastLast

Similar Threads

  1. New layer 0 - compression algorithm
    By abocut in forum Data Compression
    Replies: 5
    Last Post: 28th May 2010, 02:32
  2. The best algorithm for high compression
    By Wladmir in forum Data Compression
    Replies: 8
    Last Post: 18th April 2010, 15:54
  3. Advanced counter/predictor
    By encode in forum Data Compression
    Replies: 22
    Last Post: 28th May 2008, 00:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •