With 16 MB dictionary 3x dictionary size is sufficient as 3 bytes are enough to store values < 16 MB. For practical reasons 4x is ok.Originally Posted by encode
Note sure, what you mean with "single" next[]. There is just one.
With 16 MB dictionary 3x dictionary size is sufficient as 3 bytes are enough to store values < 16 MB. For practical reasons 4x is ok.Originally Posted by encode
Note sure, what you mean with "single" next[]. There is just one.
I meant that we need also head[].Originally Posted by Uwe Herklotz
![]()
encode
... yes, you got the idea - head[] may be arbitrary-sized, but next[] size is fixed. BT use the same idea - it uses head[], left[] and right[] where last two arrays are dict-sized, so minimum memreqs are 8x
for each buffer position p, left[p] points to the previous buffer position with the same hash value and lexicographically lesser string and right[] - to the larger one
so searching for best match is straightforward:
hash = hash(p) // where p is current pointer
match = head[hash]
while match do
--if( memcmp(match,p) < 0 )
----then match = left[match]
----else match = right[match]
but updating is really smart trick. i will try to recall it
also take notice that on each cycle step you reduce range of strings that match may point. for example, if current string is "abcd" and you compared it to the "abce" then all subsequent matches that you will check will be also less thatn "abce". otoh, if on another step you've found that this "abcd" is greater than "abb" then you may be sure that all subsequent matches you will check will be larger than "abb" - it's usual property of binary search tree. so, after checking these two matches, all matches you will check will be between "abb" and "abce"
and this means that you may reduce compare cost - if best smaller match was 2 bytes long and best larger match was 3 bytes long, you may be sure that all matches you check will have at least the same min(2,3)=2 first chars as your current string and you may skip checking these bytes. in this example, you may be sure that all subsequent matches are starting with "ab"
i use the same scheme in tor and thinks that it is much better than hash chains. as i said, hash chains was optimal in 32-64kb dictsize agesOriginally Posted by encode
my tests shows that having 128k hash entries is optimal, a least for 4-bytes hashing and my test files. one can easily figure out that for NUMNODES=64 (i call it "hash row length") we need only 32-64mb of memory regardless of dictsize
so, using linear hash is advatageous with large dicts and your lzpm proves it by outperforming lzma:fast with 16mb dict, which needs about 160mb for operation
but afair, my searching implementation is much faster. i specially published sources because you may be interesting in using it together with your compression engine
I already had a brief look at it.Originally Posted by Bulat Ziganshin
By the way, reading papers about LZCB Ive found a nice idea for my LZPM!
Its some kind of context coding of match positions.
And this surprisingly gives some improvement not only on text files!
Check out some digits on SFC for LZPM 0.02:
Layout:
<number of context bits> - <compressed size>
0 - 13,355,017 bytes
1 - 13,352,362 bytes
2 - 13,330,503 bytes
3 - 13,321,907 bytes
4 - 13,311,877 bytes
5 - 13,302,937 bytes
6 - 13,292,131 bytes
7 - 13,287,209 bytes
8 - 13,287,724 bytes
All in all, the decompression speed stays the same. For binary files the best number of context bits is 5. Currently, Im thinking about add or not such thing.
Also, paralelly I dig into the LZMA sources, now completely understand at almost all switches meaning.![]()
However, I think I should keep the arith coder as simple as it possible avoiding redundant context tricks.
For example with QUAD I tried many things for compression improvement which works but they aren't incouded finally. For example, we can code the match index with exclusions. Say, we have a 11 available match indexes from 64, so we exclude all indexes above 11 for coding.![]()
well, i dont know what is context bitsand if you have a good idea please share its description with us (at least with me, i think that two lz compressors is enough for forum with ~10 active readers
)
btw, lzma uses lc, pb and lp parameters - isnt it one of them? anyway, i dont know their exact meaning
i specially moved each pice into separate class, making easier their stealingOriginally Posted by encode
btw, Igor does the same. actually, it seems that you cant reuse my sources as is because it still missing separate hashes for 2-3 byte strings. otoh, it seems that you also dont have multiple hashes in your program. its another idea, pioneered in late 90s - instead of hashing by 3 bytes and searching all the strings via one hash, we make small separate hashes specially for searching 2-byte, 3-byte, 4-byte strings and one large hash which holds only 5-byte matches. details may differ, afair ER said that 5-3-2 scheme turns to the best for him while IP use 4-3-2 scheme
For myself I call my LZPM as a reverse engineered ROLZ2.
Firstly, about LZMA parameters:
-lc:
set number of literal context bits - [0, 8], default: 3
This number of higher bits of previous byte will be used as a context for a literal.
8 bits = pure order-1 coder
with 4 bits we will use a higher nibble of previous byte as a context
ROLZ2 is also has such parameter, which is hidden from the masses!Of course, LZPM has this parameter too.
-lp:
set number of literal pos bits - [0, 4], default: 0
This one will extend a context for literals. This may improve compression of MM data and tables.
Here we will include a lower bits of current position to a literal context.
Example:
i - current position
x - current context for literal
context = (x << 2) & (i & 3);
if -lp = 2.
Theoretically this will work. Anyway, by default LZMA will not use such thing. LZPM not uses this. AFAIK, ROLZ2 too.
-pb:
set number of pos bits - [0, 4], default: 2
With this one LZMA will make use the same pos context for some structures like Flags (IsMatch) and RepGX coders.
Again, AFAIK, ROLZ2 and LZPM not use that.
I tried this, and this will work with QUAD-like schemes, but with the LZPM things are more complicated and we unable to use such thing.Originally Posted by Bulat Ziganshin
Well this scheme will work only with ROLZ-like coders, however, with modified LZ77 coders it works as well.Originally Posted by Bulat Ziganshin
We encode the match position (or index) via order-0 arithmetic encoder. Additionally, we can include a few bits of a previous byte as a context. (Higher bits or whole byte)
Thus we will encode match positions via order-1 arithmetic encoder. Read the first paper about LZP/LZCB for more info and ideas.
![]()