Here is an update. I modified max.cfg to replace the paq8 style mixing with a paq9a chain. This improved the Calgary corpus from 655K to 650K. I also fixed the word order 1 context (it was sparse before) for another 3K. Chaining the order 0 and 1 word contexts improved another 0.4K and increasing the order 6-7 contexts by 1 (skipping order 6) improved another 0.4K.
Mixer chains only work when going from low order to high order of the same kind of contexts. For example, chaining the sparse contexts made compression worse. So does chaining the word contexts onto the end of the character contexts.
Code:
(zpaq 0.04 file tuned for high compression (slow)
on the Calgary corpus. Uses 278 MB memory)
comp 5 9 0 3 22 (hh hm ph pm n)
0 const 160
1 icm 5 (orders 0-6)
2 imix2 13 0 1 64 16
3 imix2 16 0 2 64 16
4 imix2 18 0 3 64 16
5 imix2 19 0 4 64 16
6 imix2 20 0 5 64 16
7 imix2 20 0 6 64 16
8 match 22
9 icm 17 (order 0 word)
10 imix2 19 0 9 64 16 (order 1 word)
11 icm 10 (sparse with gaps 1-3)
12 icm 10
13 icm 10
14 icm 14 (pic)
15 mix 16 0 15 24 255 (mix orders 1 and 0)
16 mix 8 0 16 10 255 (including last mixer)
17 avg 15 16 64 (average of both mixers)
18 sse 8 17 32 255 255 (order 0)
19 avg 17 18 64
20 sse 16 19 32 255 255 (order 1)
21 avg 19 20 96
hcomp
c++ *c=a b=c a=0 (save in rotating buffer)
d= 2 hash *d=a b-- (orders 1,2,3,4,5,7)
d++ hash *d=a b--
d++ hash *d=a b--
d++ hash *d=a b--
d++ hash *d=a b--
d++ hash b-- hash *d=a b--
d++ hash *d=a b-- (match, order 8)
d++ a=*c a&~ 32 (lowercase words)
a< 65 jt 14 a> 90 jt 10
d++ hashd d-- (added: update order 1 word hash)
*d<>a a+=*d a*= 20 *d=a jmp 9
a=*d a== 0 jt 3 (order 1 word)
d++ *d=a d--
*d=0 d++
d++ b=c b-- a=0 hash *d=a (sparse 2)
d++ b-- a=0 hash *d=a (sparse 3)
d++ b-- a=0 hash *d=a (sparse 4)
d++ a=b a-= 212 b=a a=0 hash
*d=a b<>a a-= 216 b<>a a=*b a&= 60 hashd (pic)
d++ a=*c a<<= 9 *d=a (mix)
d++
d++
d++ d++
d++ *d=a (sse)
halt
post
0 (may be 0 for PASS or x for EXE/DLL (E8E9))
(if x, set ph=0, pm=3)
end
Result:
Code:
278.473 MB memory required.
calgary\BIB 111261 -> 23689
calgary\BOOK1 768771 -> 199077
calgary\BOOK2 610856 -> 124026
calgary\GEO 102400 -> 46856
calgary\NEWS 377109 -> 90774
calgary\OBJ1 21504 -> 8838
calgary\OBJ2 246814 -> 56364
calgary\PAPER1 53161 -> 11192
calgary\PAPER2 82199 -> 17126
calgary\PIC 513216 -> 28690
calgary\PROGC 39611 -> 9137
calgary\PROGL 71646 -> 11062
calgary\PROGP 49379 -> 7977
calgary\TRANS 93695 -> 11628
-> 646436
1: 271/2048 (13.23%)
2: 54510/524288 (10.40%)
3: 654721/4194304 (15.61%)
4: 2041588/16777216 (12.17%)
5: 4099140/33554432 (12.22%)
6: 6439283/67108864 (9.60%)
7: 11186259/67108864 (16.67%)
8: 2620974/16777216 (15.62%)
9: 717822/8388608 (8.56%)
10: 6904750/33554432 (20.58%)
11: 34982/65536 (53.38%)
12: 38923/65536 (59.39%)
13: 42014/65536 (64.11%)
14: 454900/1048576 (43.38%)
Used 43.34 seconds
The results show memory used by component. I'm still developing this code.