1. Just recalled an old-school algorithm called PREDICTOR.

Some info can be found here:
http://rfc.net/rfc1978.html

The idea is:

Keep the "prediction table", which maps hash->symbol.

Each time we check current symbol with one stored in table; if we guess correctly, just output one bit, if not - bit+unguessed byte. Just like one-byte LZP.

2. http://www.ietf.org/rfc/rfc1978.txt

works exactly as my tarsalzp i have two models for lzp + one model for literals (unguessed bytes).

3. that's symbol ranking

4. ppp ranks #88 on LTCB (just below fpaq0f2). http://cs.fit.edu/~mmahoney/compression/text.html#5793

I call it LZP. LZP is symbol ranking with a queue size of 1, or if you prefer, ROLZ with a dictionary size of 1.

5. > I call it LZP. LZP is symbol ranking with a queue size of 1, or if you prefer,
> ROLZ with a dictionary size of 1.

Well, LZP's definition is this: http://cbloom.com/src/index_lz.html
So guess I'm all wrong here.

But still I'd prefer to call LZP a LZ77 algorithm with offsets (filtered
by current suffix) and match lengths, because LZ is supposed to be a
fast asymmetric algorithm, while symbol ranking is symmetric and not so
fast and has further generalizations.

6. By the way, for ASM lords:

Code:
```
procedure decompress;
asm
lea esi,c
lea edi,buf
lea ebp,tab
mov ecx,n
xor ebx,ebx
mov dx,1
@@1:
cmp dx,1
jne @@2
lodsb
mov dl,al
@@2:
test dx,1
jnz @@3
lodsb
mov [ebp+ebx],al
jmp @@4
@@3:
mov al,[ebp+ebx]
@@4:
stosb
shl ebx,5
xor bl,al
and ebx,&#036;fffff
shr dx,1
loop @@1
end;```

It's my ASM driven PREDICTOR. Like you see it's pretty small and cool. Looks like I will add it to RASH project...

7. Originally Posted by encode
By the way, for ASM lords:
They sit in the wasm.ru forum

8. Originally Posted by Shelwien
Well, LZPs definition is this: http://cbloom.com/src/index_lz.html
So guess Im all wrong here.
No, youre right. It is symbol ranking, unless you want to call it LZP with a fixed match length of 1

9. BTW I looked at the LZP programs. They all failed on enwik9 (out of memory) but worked on enwik8. The program tries to allocate buffers for input and output all at once.
http://cs.fit.edu/~mmahoney/compression/text.html#f330

Also, lzp1 and lzp2 crashed on calgary.tar. lzp3o2 decompressed it correctly, then crashed on exit. No problems on enwik8.

I tried compiling but there are a lot of problems with crblib with gcc and gave up. lzp2 didn't include a decompressor.

I also looked at ppmz2 but it apparently only runs in a test mode

10. By the way, Matt, your hash function from SR2 easily can fit into PPP-like algorithms.

h = (h * (5 << 5)) + c + 1;

5 = 101 in binary notation

thus:

h = ((h << 5) + (h << 7)) + c + 1;

actually we may eliminate "+ 1", in my tests:

h = (h * (5 << 5)) + c;

also we may do:

h = (h * (5 << 5)) ^ c;

This one has a better performance on binary files. In addition, for binary files we may use an order-3:

h = (h * (5 << 7)) ^ c;

In this case the table size should be 1<<21.

7+7+7=21

Any ideas about other sliding hashes?

11. Yes, c + 1 is probably not needed. I used it for better hashing of strings of zero bytes.

g++ does a nice optimization of h * (5 << 5):
Code:
```
leal    (%eax,%eax,4), %eax
sall    &#036;5, %eax```

12. Intel C++ does the same:
Code:
```
lea ebx,[ebx+ebx*4]
shl ebx,5```
So, will use this ASM code...

13. ...and VC2005 does it in the same way...

14. I also noticed that no one uses LOOP opcode.

In Delphi source we often can see:

DEC ECX
JNZ @LOOP

In some optimizations guidelines we may find something like that:

SUB ECX,1
JNZ @LOOP

But Intel C++ does even smarter:

JNZ @LOOP

15. Why should add/sub be better? add/sub requires an extra immediate operand (in this case).

16. It's not so simple. We face with some hardware specific stuff (INC/DEC modifies some additional flags and so on - hidden stuff).
pentopt.pdf

17. Thanks! Last time i read about asm optimization was in the days of 386/468 . To answer my question:

p. 84
"For historical reasons, the INC and DEC instructions leave the carry flag unchanged, while
the other arithmetic flags are written to. This causes a false dependence on the previous
value of the flags and costs an extra uop. To avoid these problems, it is recommended that
you always use ADD and SUB instead of INC and DEC. For example, INC EAX should be

BTW: After reading about hashes i implemented the "predictor" algorithm and i found an implementation on the net from, say 1989. It used 12 bit hashes and order 2 contexts.

I made an extremly fast decoder using a jump table (asm): have 16 functions (without stack frame) indexed by a group of 4 flags. function "0000" decodes four unprocessed characters .... function "1111" decodes four rank 0 symbols.

18. And hence, hash function like:
Code:
```
h=h*(5<<5)+c+1&0xfffff;```

In optimized asm looks like:
Code:
```
lea ebx,[ebx+ebx*4]
shl ebx,5
lea ebx,[ebx+eax+1]
and ebx,1048575```

The first LEA+SHL is equal to:
IMUL EBX,160
The second lea is equal to:

19. Originally Posted by toffer
I made an extremly fast decoder using a jump table (asm): have 16 functions (without stack frame) indexed by a group of 4 flags. function "0000" decodes four unprocessed characters .... function "1111" decodes four rank 0 symbols.
Currently Im working on the same thing. I wrote a super small and fast decoder on ASM, primarily for my EXE-packer/crypter.

20. Im aware of simpler things like using lea and shX. But my asm knowledge dosnt contain such tricky stuff like the one above (using add instead of dec) - since it involves knowledge of the newer pentium instruction handling.

Originally Posted by encode
Currently Im working on the same thing. I wrote a super small and fast decoder on ASM, primarily for my EXE-packer/crypter.
I meant that i did this some years ago (when i read about hashing the first time). Actually i used jmps instead of call/ret to avoid return address handling. Do you also try to use jump tables?
If you use this approach you could add (binary) RLE to compress rank 0 runs without a big preformance hit, since a decoding function dosnt involve any branches.

At university we once did a fun project, and realized some asm tasks ONLY using arithmetic and logic instrucions. The program became huge, but the supervisor was confused and couldnt follow how the source works

21. Originally Posted by toffer
Do you also try to use jump tables?
Nope. I think inline procedures and macros are better.

And here is the test program packed with RASH (No anti-emulation tricks included.):
test.exe

This is a small app that displays and moves a large picture.

The stub was written using Delphi, the decompress procedure written in a pure asm. The only reason I use asm here is that I can apply some anti debug and anti emulation tricks. In addition, hand tuned asm code is far more faster than Delphi-generated one.

22. My first thought was similar. But using a jump table completly kicks out any brances - in the end the jump table stuff turned out to be the fastest version i could get.

23. A long time ago, I noticed Delphi handles "case (...) of" structures as jump tables. That's cool isn't it?

24. This is nothing suprising. You should use case values instead of "if"s if your case values x1 x2... are continuous to allow such a transformation: jump to address[x-xmin]. Thats the reason for the existance of the case statements.

Originally Posted by toffer
kicks out any brances
Dont get this wrong i mean brachnes like: got match? decode match, otherwise decode character.

25. Originally Posted by toffer
got match? decode match, otherwise decode character.
Here I use CMP/TEST + JNE/JNZ stuff, no external functions. All ASM code fits in a single procedure.

26. @toffer:
Also, I saw jump tables in quake 2 source code. There are 8 different options for calculating distance between an axis-aligned bounding box and a plane. This must be as fast as possible. Because, thousands of calculation is needed per second! I think, jump tables are very effective. Keep in mind, game algorithms must be much faster than compression algorithms. Because, games are running in real-time.

27. thats fast if branches are long, jump table is small (so it fits easily in cache and doesnt occupy much space so its not pushed out by other data), theres many branches ( >3 ) and theres no branch thats executed very often ( with > 50 % chance ).

if theres a branch that has high probability (above 50 % or higher) then there should be normal jump for that, ie.
<div class=""jscript""><pre>
jcond .second
.first
{ very often executed code }
jump .out
.second:
jump jumptable[index]
.out:
</pre>[/QUOTE]

jump table always generate a misprediction. if you use normal conditional umps processor tries to guess if there will be a jump or not (and it guesses pretty well).

[QUOTE]<div class=""quoting"">Quoting: toffer</div>Actually i used jmps instead of call/ret to avoid return address handling.</div>
processor has special internal stack for function return addresses so it can guess the address of place to return without reading general stack. if you fool processor once then it flushes the entire internal stack and it cant predict further returns (before next calls).

currently processors have many mechanisms to predict simple regularities in data and code flow so some tricks can fool them and decrease performance.

optimization guidelines are here:
http://agner.org/optimize/

28. Code:
```currently processors have many mechanisms to predict
simple regularities in data and code flow so some tricks can fool
them and decrease performance.```
Yes, I agree. This is the answer why some assembly optimization of mine doesn't run fast as expected

29. so you should read how those mechanisms work luckily they are similiar in newer processors.

30. Originally Posted by donkey7
currently processors have many mechanisms to predict simple regularities in data and code flow so some tricks can fool them and decrease performance.
I completly agree. I rarely use asm and my asm knowledege only covers "general" tricks and some mmx/sse

Concerning branch prediction:

Flag bits themself are decorrelated compared to the original pattern, the decorrelation has a high information density (-> compression), but the decorrelated data itself looks "random". I doubt that branch prediction can handle this.

Originally Posted by osmanturan
Keep in mind, game algorithms must be much faster than compression algorithms. Because, games are running in real-time.
I think this is wrong, since even my slow cmm3 does 400*1024*8 _expensive_ coding operations a second. Ive never seen a game running at 3276800 fps!

Page 1 of 2 12 Last

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•