Okay, here is the new compressor! Still in very early stage of development, but it needs to be tested!
Enjoy!
Link:
lzpm001.zip (26 KB)
![]()
Okay, here is the new compressor! Still in very early stage of development, but it needs to be tested!
Enjoy!
Link:
lzpm001.zip (26 KB)
![]()
Good compressor! Very Faster! Good work! Hi!
On my testset - 13 780 972 bytes, compression speed 1 104kB/s, decompression 7 789kB/sAs compared to QUAD on my files, it compresses nearly as fast and decompresses 3x faster!
Thanks Ilia!![]()
Quick test...
MC SFC test:
LZPM v0.01
A10.jpg > 840,877
AcroRd32.exe > 1,649,986
english.dic > 1,020,553
FlashMX.pdf > 3,777,554
FP.LOG > 804,037
MSO97.DLL > 2,022,912
ohs.doc > 843,615
rafale.bmp > 1,093,602
vcfiu.hlp > 723,565
world95.txt > 614,800
Total = 13,391,501 bytes
QUAD v1.12
A10.jpg > 853,522
AcroRd32.exe > 1,503,119
english.dic > 915,432
FlashMX.pdf > 3,831,098
FP.LOG > 717,207
MSO97.DLL > 1,913,758
ohs.doc > 848,771
rafale.bmp > 1,036,312
vcfiu.hlp > 708,048
world95.txt > 625,831
Total = 12,953,098 bytes
ENWIK8:
LZPM v0.01
Compressed Size = 29,154,666 bytes
Compression Time = 00:03:34.484
QUAD v1.12
Compressed Size = 29,649,604 bytes
Compression Time = 00:01:06.625
lzpm must be potentially slower than QUAD at compression. However, the decompression speed... ...which can be even faster - I have a few ideas and probably lzpm will have faster decompression than LZMA (and of course faster than ROLZ2).
Now I'm working on two things:
+ Arithmetic encoder
+ LZ decoder
So the main LZ encoder stays very draft - since currently I'm trying to get the fastest decompression.![]()
Yep, new LZPM is faster or very close to the LZMA!!!
I just implement the same trick as LZMA make use - instead of keeping counters of 0/1, we keep just the probability of a next bit. As a result I get the super-small and super fast arithmetic compression!
To be continued...![]()
The new model provides some interesting results. Overall, the compression becomes higher. But looks like in some cases we get slightly worse text compression.
Okay, check out some results:
A10.jpg:
832,328 bytes [!]
acrord32.exe
1,634,409 bytes (Again, keep in mind, no filters used)
world95.txt
619,408 bytes
Actually, the upcoming lzpm is just crazy!!!![]()
Good work, Ilia!
Crazy as in possible super-high compression ratio?Originally Posted by encode
![]()
How fast decompression is!Originally Posted by LovePimple
Plus, higher binary compression!
Some timings:
Decompression of pak0.pak (183,997,730 bytes)
lzma 4.43: 20 sec
lzpm 0.02: 22 sec
mcomp v1.00 (rolz2): 27 sec
lzpm 0.01: 28 sec
quad 1.12: 37 sec
As you can see, new lzpm 0.02 is faster than ROLZ2, QUAD and pretty close to the LZMA!![]()
Very powerfulOriginally Posted by encode
Increasing compression ratio will increase also the decompression speed, wont it?
Yepp, better LZ layer can lead some decompression speed boost. (more matches = faster decompression). However, LZPM 0.02, at least currently, has exactly the same LZ part as v0.01. Anyway, 0.02 overall has a higher compression and at the same time faster compresison/decompression speed. Its due to a heavily improved model and arithmetic encoder!Originally Posted by Black_Fox
![]()
Some note about Flexible Parsing with this engine - just tried this one. With hashing such algorithm is unefficient - compression speed becomes significantly slower and compression gain is tiny. I also tried do the Flexible Parsing with this engine with brute-force approach - this one gives a nice compression gain, especially with text files. However, like I said, with brute-force approach compression becomes 100X [!] and more time slower!
It's unpractical!
Probably in next versions I'll enhance current Lazy Matching, say I will try two bytes lookahead instead of just one.
Continue digging...
Could be something like --ultra-brute mode from UPXOriginally Posted by encode
![]()
Well, at least it is possible. An LZ engine with PAQ8 speed.Originally Posted by Black_Fox
However, the decompresison speed stays untouched! If just PAQ8 had MUCH faster decompresison...
Anyway, currently I think its not worth it. Some people say that QUADs "-x" mode is also not really worth the compression time wasted.
![]()
cabarc and lzma uses binary trees when doing optimal parsing
btw, i'm working on fast lz search engine. this may make practical optimal/flexible parsing without using bintrees. actually, idea is the same as in quad - use small array of matches for each hash index and cache a few symbols from match in hash in order to make searching faster
btw, where i can read about flexible parsing? i don't understood it from your sources
Awesome!Originally Posted by encode
![]()
Some info on FP:Originally Posted by Bulat Ziganshin
http://arturocampos.com/ac_flexible_parsing.html
Actually, I has a very small understanding about binary trees.Originally Posted by Bulat Ziganshin
Probably I should carefully look at LZMA source. Can you explain some basics or provide points about these trees in human readable form?
Binary Trees are very easy to understand have a look at http://en.wikipedia.org/wiki/Binary_tree and to the links on this page.
B Trees are also a good variant of BTs, which you should know.
Have a nice day =P)
well. are you understand idea of hash chains and in particular why their memory requirements are 4*dictionary size?
thometal
for unserstanding BT used for searching LZ matches, knowing usual BT is a prerequisite, but not the whole picture
I do. The idea of hashing and hash chains is quite simple. From LZSS.C by Haruhiko Okumura I get just the slight understanding whats going on.Originally Posted by Bulat Ziganshin
From LZMA docs Ive found that Binary Trees and Patricia Trees are also uses hashing. Do I use already Binary Tree with LZPM? Since for each hash value I keep many "elements".![]()
well, why memory reqs are exaactly 4*ddictsize?
i ask because it doesn't seem that you understand this important moment
prev[], head[] &co.
Bulat, better explain!![]()
But actually, I beleive that we can change the HASHSIZE - and control the memory usage...![]()
so, let's start from hash chain
in 80's, lz searching was slow and required large amounts of memory. either classical binary trees (proposed by Bell in 84) or patricia trees (as in ar002) was used. their memory requirements was so high that 8-16kb dictionary was a maximum and programs was slow, so lzw compressors was quite popular
hash chains, invented by R. Jung, dramatically decreased both memory requirements and compression time. with hc, there are two arrays - head[] and next[]. head[] may have arbitrary size, it's determined only by amount of bits we keep in hash value. it points to the last string with given hash value. next[] have exactly one word for each byte in dictionary and allow to jump to previous string having the same hash as its index. i.e.
hash = hash(p) // where p is current pointer
match = head[hash]
while match do
memcmp(match,p)
match = next[match]
by using this technique with ar002 compression engine, arj outperformed all other compressors in early 90's. then it was accepted by pkzip and open-source zip; and zip sources becomes starting point for many other projects. using this technique allowed to raise dictsize up to 64kb in mid-90s
rar 2.0 was among first compressors used even larger, 1mb dictionary. it continued to use HC but it turns to be too slow for such large dictionary - rar2 speed was about 50-100 kb/s on today's cpus. the reason is that with larger dict we have much more strings in each hash chain, so using hc with 1mb dictionary is probably 16x slower that for 64kb one
of course, things may be even worser if someone ever tried to use it for optimal parsingnevertheless, cabarc published approx. in 97, worked faster that rar2 while having larger dictionary and optimal parsing with *exhaustive* match searching....
Thanks for info!
So, with 16 MB dictionary a single next[] needs 4X buffer size!
The scheme I make use:
Hash lookup:
hashtable[HASHSIZE][NUMNODES];
i.e. - we hash only first N bytes (say MINMATCH), and we have NUMNODES (say 256) different positions to check.
At each step we update our hashtable as a list. The oldest hashes will be removed from the table.
With this approach we can easily control the memory usage...
Usual binary search trees do not use hashing, but hashing may be used to address a certain binary tree. Using multiple binary trees is reasonable in order to keep the trees small and to reduce the efforts for tree restructuring.Originally Posted by encode
According to your previous descriptions: no, you dont.Originally Posted by encode