Blizzard - Fast BWT file compressor by Christian Martelock...
Christian's web site: http://christian.martelock.googlepages.com/Originally Posted by Christian Martelock
Blizzard - Fast BWT file compressor by Christian Martelock...
Christian's web site: http://christian.martelock.googlepages.com/Originally Posted by Christian Martelock
It's really fast and efficient. But, it has a poor compression on english.dic. Blizzard compress english.dic to 1,170,607 bytes. Most of compressor get it down to around 500-750kb.You know english.dic is a special file which a single word never occured. This made me think of about a new trick on common files introduced by christian![]()
just a quick testing with Enwik9 and compared to CCMx (6) and RZM with and without using BWTlib as prefilter
Seems BWTlib help RZM but hurrs CCMx when compression Enwik9.Code:enwik9 1.000.000.000 bytes enwik9.blz 215.520.550 bytes enwik9.ccm 175.238.339 bytes enwik9.rzm 210.126.103 bytes enwik9.bwt.ccm 186.057.176 bytes enwik9.bwt.rzm 201.141.372 bytes
it wold be nice to see if a BLizzard optimized for compression ratio would be as good as RZ;/CCMx in compression
just a quick update'
i tried using bigger blocksize of 256000000 bytes
enwik9.blz2 175.414.896 bytes
Thats damn close to CCMx (6)
this seem interesting and I will be back with some decompression time benchmarks
Tested. It may not be the best within BMP sector, but on PSD file it kicks ass massively regarding speed/ratio. Overal it's pretty nice, Christian isn't genius just for nothing.![]()
I am... Black_Fox... my discontinued benchmark
"No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates
For last column the byte order in files was reversed before compression.Code:[BCM] [o1rc9f] [o1rc9g] [Bliz f] [Bliz c] [Bliz' c] calgary.tar 791,435 785456 784037 800280 790491 787745 book1 212,674 211047 210908 215982 212130 212076 world95.txt 474,291 470390 469640 481578 474891 473631
Last edited by Shelwien; 3rd July 2008 at 07:33.
Hi there!
Thank you for good feedback and the tests!
I did not announce Blizzard in here because it's a boring common implementation. Well, the context mixer directly after the BWT transform differs from most other BWT implementations, but we already have BBB and BCM doing the same.
Try using blocksize 10000 - this should be a bit better. But after all, Blizzard is just a BWT. So, such performance is expected (OT: CCM does not use an LZ layer).
Btw., there is a tiny update on my site. I just removed two lines of code from the exe-filter.
Well, there isn't a particular place where CCM gains speed. e.g. on precompressed data where no LZ layer can help, CCM is still faster than BIT, CMM or LPAQ. On the other hand, CMM and LPAQ are stronger than CCM - LPAQ especially on text data. I often said, that CCM is different from PAQ. Well, I think this is the reason for its speed. PAQ is really great, but you should not use it as a template in order to write a fast context mixer.
(Well...becomes more out of topic. forgive me for this)
Could you explain your statistics storage techniques? It's totally hashed or there is hashed list besides a small context tree for frequently occured contexts or anything else? I would be very happy if you point out some other details. Surely most of other people will be happy, too.
@christian
your new bwt-compressor
seems to be a very fast one for a bwt-class-compressor !
congratulation !
the cmdline "bliz c d:db.dmp c:db.bliz 100000000"
requests RAM-memory up to 589.496 Kbytes
---
Block size is 97656 KiB (587463 KiB allocated).
Compressing (normal profile)...
All done: 633136 KiB -> 31742 KiB (5.01%)
---
on a win2003-system with 4 gbyte ram and two xeon - prozessors:
compression:
sourcefile db.dmp 648.331.264 bytes
balz113 e 33.390.086 bytes 9 min
7zip 35.151.362 bytes 19 min
bcm 001 31.927.387 bytes 22 min
bcm 002 31.022.446 bytes 17 min
bliz 0.24 32.504.413 bytes 8 minutes
---
this means
1. bliz is two times faster then other bwt-compressors
2. bliz can beat 7zip in compression ratio with lesser then half the time !
very good work !
what about to implement compression of a whole directory inclusive subdirs ?
Last edited by joerg; 3rd July 2008 at 14:24.
It is hashed and uses simple collision resolution. Of course, the order-1 submodel is not hashed.
In the flagship-thread Przemyslaw quoted a post I wrote on the old forum. It's still valid. I think it is a good starting point.
@joerg:
Thanks you! Can you please test bliz with blocksize 16800000 and 33600000. This should be interesting, too. The smaller blocksize should be better on this file.
Regarding the archiver functionality, I'm sorry. But it just takes too much time.
Last edited by Christian; 3rd July 2008 at 14:40. Reason: typing error
Christian, for quite a long time I was inactive in compression and recently I've found your CCM. Congratulations as you've managed to beat my favourite compression artists: Dmitry (PPMd, PPMonstr) and Matt+Alexander (lpaq).
http://www.maximumcompression.com/data/summary_mf.php :
LPAQ8 78704186 bytes in 1312 s
CCM 1.30c 78598980 bytes in 277 s
PPMonstr J 78086417 bytes in 1628 s
The results (speed) are impressive. I believe that you should release CCM as an open-source CM library (it can be commercial, but free for non-commercial use) to take place of widely-used PPMd.
Please join us at http://ctxmodel.net/rem.pl?-21
-------------------------------------------------------------------------------
Blizzard 0.24 (Jun 29 200- Copyright (c) 2008 by Christian Martelock
-------------------------------------------------------------------------------
bliz c d:db.dmp c:bl1.bliz 100000000
Block size is 97656 KiB (587463 KiB allocated).
All done: 633136 KiB -> 31742 KiB (5.01%)
---
bliz c d:db.dmp c:bl2.bliz 16800000
Block size is 16406 KiB (98694 KiB allocated).
All done: 633136 KiB -> 30765 KiB (4.86%)
---
bliz c d:db.dmp c:bl3.bliz 33600000
Block size is 32812 KiB (197387 KiB allocated).
All done: 633136 KiB -> 31103 KiB (4.91%)
---
compression result:
sourcefile db.dmp 648.331.264 bytes
7zip 35.151.362 bytes 19 min
bcm 001 31.927.387 bytes 22 min
bcm 002 31.022.446 bytes 17 min
bliz 0.24
bliz blocksize 100000000 -- 32.504.413 bytes 8 minutes
bliz blocksize 16800000 -- 31.503.372 bytes 7 minutes
bliz blocksize 33600000 -- 31.850.304 bytes 7,5 minutes
@christian
Thanks you!
You are right:
with lower blocksize=16800000 - compression ist faster and better.
i understand:
"Regarding the archiver functionality,
I'm sorry. But it just takes too much time."
but i think
in the future
if we want to use such new compressor for real purposes
we need support for directory inclusive subdirectories
...
thinking...
...
maybe it would be possible
to create a compressor-independed archiv-format
like " *.ISO = CD-image-format ",
which contains the directory-structure
and all files within this structure are in a compressed format
..
what do you thinking about this?...
- may be another programer can implement
such a compressor-independed archiv-format
- would you use such a independed archiv-format?
- it would be usable
for you other wonderful compressors like ccm or slug too...
- or would you tend to give some compression-code
for such a implementation?
- may be dual-license (commercial) or open-source?
Thank you for your wonderful compressor programs !
Have you any plans for the future in this way ?
Last edited by joerg; 3rd July 2008 at 17:38.
Thanks a lot Przemyslaw!
Still, MFC favors CCM over LPAQ because of its data filters. I don't know about PPMonstr, though.
I'm thinking about it. But it's also a matter of time. And replacing PPMd with CCM is no win-win situation. PPMd beats CCM on some files like "world95.txt".
@Joerg:
Currently, I don't have any time for a new project. But this might change in a few months. So, once everything has settled in again, I might release some of my work as OSS.
my own suggestions list:
1) implement stdin/stdout interfaces and win32/linux versions for all your compressors. this makes it very easy to use them inside full-scale archivers
2) remove 2gb limit in rzm. it's really disappointing since rzm has beatiful speed/compression ratio
3) if rzm will allow to increase dictionary, and dictionary size may be set independent from hash size, it will be possible to outperform winrk
ppmonstr doesn't include filters, durilca is ppmonstr+filers, while durilca'light is ppmd+filters
Interesting, because MFC it's over 400 files in 30+ formats and CCM gets on input all the files mixed in a single TAR archive.
Still, CCM is better on average than LPAQ on:
http://www.squeezechart.com/main.html
http://www.winturtle.netsons.org/MOC/MOC.htm
My conclusion is that LPAQ is better than CCM on TXT and EXE because of its filters or LPAQ is tuned for SFC.
True, and this is a place for improvement. You should detect textual files (what is quite simple) and use different algorithm (models, weights, mixers) just like lpaq2+ does it.
Last edited by inikep; 3rd July 2008 at 22:00.
CCM does data filtering based on content, so the TAR archive does not matter.
Yes, that's the way I'd do it, too. But at the time I wrote CCM, I was obsessed with using only one simple algorithm for everything.
And then I moved on with other things like Slug, RZM and now Bliz. But I'm sure, that I'd be able to improve CCM a bit further without losing speed. Nonetheless, a rewrite of an improved CCM would take long. So, for the time being, I stick to lightweight projects like Slug and Bliz. They can be finished in several hours and are still fun (as long as I don't add archiver functionality).
0.24b is faster and better![]()
I am... Black_Fox... my discontinued benchmark
"No one involved in computers would ever say that a certain amount of memory is enough for all time? I keep bumping into that silly quotation attributed to me that says 640K of memory is enough. There's never a citation; the quotation just floats like a rumor, repeated again and again." -- Bill Gates
both contains MM data. you may find MM-cleaned comparison of ccmx 1.23 and lpaq1 at http://freearc.org/Maximal-Practical-Compression.aspxStill, CCM is better on average than LPAQ on:
http://www.squeezechart.com/main.html
http://www.winturtle.netsons.org/MOC/MOC.htm
my own comparison on text data:
on mainly binary dataCode:7-zip 4.58 6.847 0.239 12.058 rzm 0.07h 7.366 0.122 3.588 uharc -mx 7.404 0.239 0.262 freearc -mx 7.699 0.514 0.990 ccmx 1.30 8.098 0.196 freearc -max 8.374 0.223 0.287 lpaq8 8.705 0.058
first column is compression ratio, second and third - compression/decompression speedCode:7-zip 4.58 5.133 0.292 10.259 freearc -mx 5.252 0.340 1.891 uharc -mx 5.322 0.168 0.191 freearc -max 5.407 0.236 0.589 rzm 0.07h 5.430 0.139 2.578 ccmx 1.30 5.448 0.217 lpaq8 5.684 0.059
Last edited by Bulat Ziganshin; 3rd July 2008 at 23:11.
because "mainly binary data" includes, may be, 20% of texts. actually, this compression profile is rather close to that of squeeze chart or mfc
as anticipated the results of bliz 0.24b
accordingly to the file db.dmp
are the same as with bliz 0.24
Compression speed test on Maniscalco's Corpus:
bliz0.24b is very fast, excluding 'abba'.Code:Timing results (in secs, blocksize = default) bzip2 bliz24b bcm002a abac 0.15 0.04 45.81 abba 11.21 2144.79 11303.42 book1x20 2.59 0.54 24616.59 fib_s14930352 15.07 0.07 20739.16 fss9 2.84 0.03 give up fss10 12.01 0.09 houston 3.31 0.06 paper5x80 0.89 0.01 test1 1.89 0.01 test2 1.89 0.03 test3 2.04 0.03
The issue is caused not by preprocessors itself, but by lack of memory.
FreeArc compression algorithms are actully a chain of 2 to 4 algorithm, first one is applied to the data, second one is applied to the output of 1st algortithm and so on, the output of the last algorithm is written to archive.
In -mx mode, some algorithms use too much memory and thus have to be applied sequentially: first algorithm output is written to the temporary file, then it is read by second and so on. That's too slow. So, -mx mode is usually impractical.
EXE(BCJ) & Delta are definitely not the reason: they use too little memory. Probably that's the REP.
And text shouldn't be thae reason, because text usually compresses faster than binaries.