OK, on Victory Day, let me introduce the best compressor I have ever made. With this new BALZ I combined all my knowledge to made fast and efficient compressor. Just check it out!
http://encode.su/balz/index.htm
![]()
OK, on Victory Day, let me introduce the best compressor I have ever made. With this new BALZ I combined all my knowledge to made fast and efficient compressor. Just check it out!
http://encode.su/balz/index.htm
![]()
Thanks Ilia!
Mirror: Download
Quick test...
Test Files: MC SFC
BALZ [e]
A10.jpg > 836,451
AcroRd32.exe > 1,460,086
english.dic > 861,622
FlashMX.pdf > 3,762,896
FP.LOG > 672,319
MSO97.DLL > 1,885,369
ohs.doc > 830,267
rafale.bmp > 1,048,825
vcfiu.hlp > 689,667
world95.txt > 608,341
Total = 12,655,843 bytes
BALZ [ex]
A10.jpg > 836,451
AcroRd32.exe > 1,452,757
english.dic > 797,618
FlashMX.pdf > 3,758,128
FP.LOG > 581,503
MSO97.DLL > 1,879,986
ohs.doc > 824,992
rafale.bmp > 1,019,511
vcfiu.hlp > 668,959
world95.txt > 588,466
Total = 12,408,371 bytes
Test File: ENWIK8
Compression
BALZ [e]
Compressed Size: 28,674,640 bytes
Elapsed Time: 430.035 Seconds
0000 Days 00 Hours 07 Minutes 10.035 Seconds
BALZ [ex]
Compressed Size: 28,234,913 bytes
Elapsed Time: 681.141 Seconds
0000 Days 00 Hours 11 Minutes 21.141 Seconds
Decompression
BALZ [e]
Elapsed Time: 21.425 Seconds
BALZ [ex]
Elapsed Time: 21.183 Seconds
Thanks LovePimple!
Results on ENWIK9:
BALZ v1.06, e: 249,378,397 bytes
BALZ v1.06, ex: 245,288,229 bytes
![]()
LTCB results http://cs.fit.edu/~mmahoney/compression/text.html#2453
I've been busy with other work so I am falling behind on data compression stuff. BTW, I like the new forum.![]()
Thank you!![]()
That's not all folks! I made a new improvement - enlarge the ROLZ model form 64 MB to 128 MB!New BALZ v1.07 has even higher compression - especially on large and text files. Although it's slower. However, I increased the gap between "e" and "ex" modes - now "e" is more faster. (Just according to Matt's results the gap between these too modes is too small - compression times are too close)
OK, some testing results:
fp.log: 573,485 bytes
english.dic: 784,590 bytes
world95.txt: 576,362 bytes
rafale.bmp: 1,009,371 bytes
ENWIK8: 27,423,485 bytes
ENWIK9: 237,606,519 bytes
![]()
Also, I have an idea to make the "e" mode real *FAST* - disabling any optimizations.What do you think?
For example, BALZ v1.07 with various parsing schemes:
0-Greedy Parsing
1-Advanced Lazy Matching with 1-byte lookahead
2-Advanced Lazy Matching with 2-byte lookahead
3-Advanced Flexible Parsing
world95.txt:
0: 639,369 bytes
1: 605,878 bytes
2: 595,229 bytes
3: 576,362 bytes
fp.log:
0: 702,537 bytes
1: 675,286 bytes
2: 671,477 bytes
3: 573,485 bytes
acrord32.exe:
0: 1,481,176 bytes
1: 1,465,609 bytes
2: 1,460,975 bytes
3: 1,453,415 bytes
reaktor.exe:
0: 2,099,718 bytes
1: 2,070,973 bytes
2: 2,069,256 bytes
3: 2,030,253 bytes
Like you see, the mostly parsing is important with text files. Maybe the greedy parsing keeps too much air in files. So, for Default mode reasonable to keep or Greedy or Advanced Lazy Matching with 1-byte lookahead. Your opinion?
Some results with timings:
3200.txt (16,013,962 bytes):
0: 5,274,712 bytes, 7 sec.
1: 5,042,148 bytes, 13 sec.
2: 4,988,928 bytes, 18 sec.
3: 4,889,024 bytes, 27 sec.
![]()
Like you see, the mostly parsing is important with text files.
I don't know about the speed differences, but 1 looks good for a fast default mode.
Option 1 still looks good, although the speed hit over 0 is big.
Last edited by Christian; 10th May 2008 at 18:25.
E2160 @ 9x360=3.24, DDR2-800 5-5-5-18 @ 900
http://shelwien.googlepages.com/balz.htm
1.06 looks much better, but I'd still prefer rar.
Also I was thinking about it while testing... and now
lost the last reasons to use LZ at all.
Btw, that "last reason" was program distribution and the like,
on which I based my test metric (compression time + time of
downloading at 512kbps and decompression x 10).
But now, if I think about it, I won't feel much inconvenience
if rar's decoding speed became 10x slower - guess, that's
because rar's algorithm was almost the same when I was
using it on 386s.
Then, anyway, I don't really care that much about program
installation time, and I mostly use archivers... well, for
archiving. And that's where low speed and data dependence
of LZ optimal parsing is completely bad... I mean, why should I
asymmetrically compress my DVD images when I'd never need
to unpack most of them... while there're symmetrical methods
with better and faster compression?
Last edited by Shelwien; 10th May 2008 at 19:17.
I still prefer LZ-based algorithms. Things like new BALZ may be considered as a PPM approximation and at some point is close to PPM* or PPMDet, since it has many context based tricks - all LZ output encoding is done via quite complex order-1 context models, offsets selection based on context, etc.
Anyway, the compressor selection is up to each user...
Another results with timings:
ENWIK8:
0: 29,693,191 bytes, 55 sec.
1: 28,272,252 bytes, 86 sec.
2: 27,934,454 bytes, 108 sec.
3: 27,423,485 bytes, 180 sec.
Still the "1" should be the most efficient in terms of compression ratio vs. compression time. Like you see the compression ratio with this mode is notable better than with "0" (Greedy, Unoptimized parsing);
![]()
Forget about users, even 7-zip still doesn't have that much of attention.
But what use do _you_ have for LZ compressors?
Guess you could try replacing zlib in game data compression and the like...
But then again, its totally dumb to compress textures and music with LZ.
BTW, my LZ modifications are quite efficient on textures - even QUAD performs better than WinRAR's PPMD:
http://quad.sourceforge.net/
Of course, new BALZ performs even better...![]()
Of course ppmd isn't any better in that area.
But you could compare eg. with bmf.
And what I wanted to say is that 2D LZ algorithms are absolutely impractical.
Just look at motion prediction...
Of course specialized models are supreme...
But if we talking about "General Purpose" term...
1. The point is that specialized LZ-like models are ineffective, though not impossible.
2. "Purpose" is what I'm getting at. What do you want to have ideally?
ppmd-level compression with memcpy-like decoding speed?
Btw, how do you know which LZ compressor is better?
As to me, I don't know actually.
Well, its obvious that the one faster both in encoding and decoding
and having better compression is superior, but that's rare.
Ideally, the compression should be fast enough - optionally we should able to choose compression mode - favor compression speed or ratio. Memory usage should not be too high - lower=better, less than 200 MB is preferred. The decompression speed should be as fast as possible, in practice, it should be faster or equal to plain order-0 arithmetic encoder a-la FPAQ0.
BTW, I have an idea about to remove LZ-layer from BALZ and add a SR (Symbol Ranking) or PPMDet variation ? to see what we will have. The results will be posted tomorrow ? just interesting to see how LZ can fight with SR-like stuff. At least with SR we do not need in any parsing, compression might be faster, but decompression slower?![]()
Cool. Let's hope that'd become a first step away from LZ
Decompression would be faster if you'd add support for rank0 runs.
And SR in general is more complex than CM because you're supposed to
extrapolate the symbol ranking (which is ideally done by the same model as CM),
and then you need another model to encode the result.
Also, as I said, CM can benefit from optimal parsing (in fact, it provides much wider choice with full updates, partial updates, update exclusion etc).
Btw, just remembered something.
In your LZ, are you masking out the symbol following the match?
Why don't, then? Is it possible to encode the match shorter than the real match with the given offset?
Still, I doubt that there's any sense encoding the incomplete match and then unmasked literal, right?
Also there're other kinds of masking... like ofs2!=ofs1+len1, in the sequence <ofs1;len1><ofs2;len2>.
Edit: Also wonder how frequent are these incomplete matches with your optimizer...
You know, its possible to do partial masking too... like significantly decreasing the symbol probability, while still not making it zero.
Last edited by Shelwien; 11th May 2008 at 23:02.
1. Do you prefer your bitwise model to smaller redundancy?
Though of course its possible... at least you can cut off a bit
from most probable symbol or something.
Or just properly recalculate the probabilities
2. Which last words? I mean that there're some obvious rules...
Like that 2 matches for the same data are never better than a single match.
Edit: But actually, guess you just have to add the suffix of previous match to SSE models for these bits.
Also I think that you should use unary encoding at least for some symbols... which calls for masking too.
But it would both make the processing faster (both encoding and decoding) and could improve compression, especially with SSE.
Last edited by Shelwien; 11th May 2008 at 23:22.
Agrh, Understand now. Well to squeeze the last bit from the output is not about LZ coders.Adding some tricks like exclusions may done the decoder too expensive - at the just tiny compression benefit. Don't forget about asymmetry!
BALZ's bit model is very special - I tried many models, including PAQ1-styled, FPAQ0, FPAQ0P styled, but stayed with a special set of FPAQ0P models with a Mixer - a la FPAQ0M - since it turns out the best overall...![]()