I found an interesting test set.
Open Office documents are zips with several files, mostly xmls. I wanted to know whether zip is a good tool for the task and results turned out to be quite interesting.
Data: Spreadsheet, archivers testing results with some charts. Original size - 3 185 340 B. Zipped by OO - 210*481 B. Very redundant...
Test set is small, so timing is not very accurate, especially with fast compressors.
I can't measure how fast does OO compress the file, I could use stopwath, but it would be very inaccurate. Also generating xmls increases saving time and I have no idea by how much. The closest thing I could do was to create a zip with 7z.
Code:
Archiver Size Time
7z -tzip -mx=1 226850 0.187
7z -tzip -mx=3 226850 0.203
7z -tzip -mx=5 216749 0.640
7z -tzip -mx=7 208253 1.203
7z -tzip -mx=9 195038 5.437
OO seems to use something close to 7z -mx7, but a bit weaker. It takes probably 0.8-1s.
How good can zip be? I tried kzip, it claims to generate zips smaller than PKZIP by 1-3%.
Code:
Archiver Size Time
kzip /s0 /b0 189783 42.516
kzip /s0 /b128 194807 53.640
kzip /s0 /b256 189461 57.406
kzip /s0 /b512 187122 50.406
kzip /s0 /b1024 187816 46.641
kzip /s1 /b0 190463 33.406
kzip /s1 /b128 195240 57.562
kzip /s1 /b256 189862 43.656
kzip /s1 /b512 187444 42.593
kzip /s1 /b1024 188183 38.032
kzip /s2 /b0 319485 0.688
kzip /s2 /b128 325280 1.937
kzip /s2 /b256 319849 1.593
kzip /s2 /b512 317122 1.281
kzip /s2 /b1024 317677 1.125
kzip /s3 /b0 1787918 0.578
kzip /s3 /b128 1763812 1.812
kzip /s3 /b256 1771326 1.468
kzip /s3 /b512 1779670 1.156
kzip /s3 /b1024 1782342 1.015
Very slow, but the size got down to 187122 B, 11% smaller than original, 4% smaller than 7zip. Very small.
Now other compressors...The best results:
Code:
Archiver Size Time
FastLZ opt -2 370534 0.015
FastLZ -2 365344 0.016
quick -0 311703 0.031
slug 198995 0.046
NanoZip -cd 186876 0.093
4x4 1t 123047 0.171
4x4 2t 122051 0.234
4x4 4t 121034 0.296
FreeArc -m4 -ms 117940 0.875
FreeArc -m5 -ms 116319 1.421
FreeArc -m5 115312 1.437
FreeArc -m7 115305 1.453
CCM 0 108787 1.578
CCM 1 108579 1.609
CCM 2 108481 1.735
CCM 3 108433 1.860
CCMX 0 106744 2.031
CCMX 1 106225 2.094
CCMX 2 105849 2.250
CCMX 3 105634 2.437
FreeArc -max -ms 95661 2.515
FreeArc -max 94654 2.531
FreeArc -max -ma- 86912 3.734
NanoZip -cc 84755 16.297
PAQ8p -1 83124 66.625
PAQ8p -2 81566 67.578
PAQ8p -3 81040 68.375
PAQ8p -4 51999 499.015
PAQ8p -5 50780 501.672
PAQ8p -6 50046 513.891
PAQ8p -7 49870 538.828
FastLZ needs just 0.015s, that's over 200 MB/s. IO is definitely cached by OS.
Slug makes it smaller than OO while being 15 times faster.
4x4 1t almost halves the OO result and is 5 times faster (!!!).
Then there's nothing really interesting until PAQ8p -3/4...I tested several times, there were no memory issues, -4 really takes that long. And decompresses it's output. I tried to investigate, it seems to get consistent gains on almost all files and is always equally slow.
But there's another thing. Let's calculate efficiency as maximumcompression.com does:
Code:
Archiver Size Time Efficiency(maximumcompression.com)
FastLZ opt -2 370534 0.015 340654353845826000.0
FastLZ -2 365344 0.016 176627773533093000.0
quick -0 311703 0.031 197866418022820.0
slug 198995 0.046 46172317.6
NanoZip -cd 186876 0.093 17320813.6
4x4 1t 123047 0.171 4468.6
4x4 2t 122051 0.234 5324.4
4x4 4t 121034 0.296 5847.4
FreeArc -m4 -ms 117940 0.875 11243.8
FreeArc -m5 -ms 116319 1.421 14576.4
FreeArc -m5 115312 1.437 12815.3
FreeArc -m7 115305 1.453 12945.4
CCM 0 108787 1.578 5682.1
CCM 1 108579 1.609 5628.6
CCM 2 108481 1.735 5987.3
CCM 3 108433 1.860 6376.0
CCMX 0 106744 2.031 5505.4
CCMX 1 106225 2.094 5281.2
CCMX 2 105849 2.250 5385.7
CCMX 3 105634 2.437 5661.5
FreeArc -max -ms 95661 2.515 1460.9
FreeArc -max 94654 2.531 1278.2
FreeArc -max -ma- 86912 3.734 642.9
NanoZip -cc 84755 16.297 2079.1
PAQ8p -1 83124 66.625 6775.6
PAQ8p -2 81566 67.578 5534.4
PAQ8p -3 81040 68.375 5204.9
PAQ8p -4 51999 499.015 670.9
PAQ8p -5 50780 501.672 569.3
PAQ8p -6 50046 513.891 526.6
PAQ8p -7 49870 538.828 538.8
Ladies and gentlemen, welcome the new efficiency king, PAQ8p. Who cares that it gets 6 KB/s, what a great size! I wonder why didn't OO team choose to use it, maybe we should suggest it to them?
I know that some uses might be more sensitive to file size and less to speed than office documents, but that's just ridiculous.
I've been thinking about different measure for efficiency for some time and now it's the time to show my take on the topic.
1. Copying is usually a very viable method of archiving, much more than PAQ. And IMO this is what archivers should be compared to.
2. Extreme slowness = 0 usefulness = 0 score.
3. Use of minimal size is wrong. If I was looking for something under 0.1s on this test, I couldn't care less about PAQ scores. And the fact that I tested it chaged the ranking. Remove things slower than 0.15s and slug wins. Include them - NanoZip is better. You always have to recalculate everything to your time boundaries.
My proposal is: POWER(10;1/10)/((size/original_size)*LOG(size/original_size+1;2)*POWER(POWER(10;1/10);time/time_of_copying)) (A bit unreadable, but I won't learn tech to show it better
).
The higher score the better. XCOPY gets 1. I call these which score at least as much "practical".
Results:
Code:
Archiver Size Time Efficiency(proposed) Efficiency(maximumcompression.com)
FastLZ opt -2 370534 0.015 66.52 340654353845826000.0
FastLZ -2 365344 0.016 68.26 176627773533093000.0
quick -0 311703 0.031 90.80 197866418022820.0
slug 198995 0.046 213.82 46172317.6
NanoZip -cd 186876 0.093 224.14 17320813.6
4x4 1t 123047 0.171 450.79 4468.6
4x4 2t 122051 0.234 413.32 5324.4
4x4 4t 121034 0.296 379.77 5847.4
FreeArc -m4 -ms 117940 0.875 155.30 11243.8
FreeArc -m5 -ms 116319 1.421 65.44 14576.4
FreeArc -m5 115312 1.437 64.86 12815.3
FreeArc -m7 115305 1.453 63.20 12945.4
CCM 0 108787 1.578 57.83 5682.1
CCM 1 108579 1.609 55.18 5628.6
CCM 2 108481 1.735 45.00 5987.3
CCM 3 108433 1.860 36.72 6376.0
CCMX 0 106744 2.031 28.66 5505.4
CCMX 1 106225 2.094 26.10 5281.2
CCMX 2 105849 2.250 20.38 5385.7
CCMX 3 105634 2.437 15.08 5661.5
FreeArc -max -ms 95661 2.515 16.16 1460.9
FreeArc -max 94654 2.531 16.08 1278.2
FreeArc -max -ma- 86912 3.734 2.67 642.9
NanoZip -cc 84755 16.297 0.00 2079.1
PAQ8p -1 83124 66.625 0.00 6775.6
PAQ8p -2 81566 67.578 0.00 5534.4
PAQ8p -3 81040 68.375 0.00 5204.9
PAQ8p -4 51999 499.015 0.00 670.9
PAQ8p -5 50780 501.672 0.00 569.3
PAQ8p -6 50046 513.891 0.00 526.6
PAQ8p -7 49870 538.828 0.00 538.8
There's one more interesting thing.
Code:
Archiver Size Time Efficiency
PAQ9a 1 98727 3.140 5.47
PAQ9a 2 97583 3.062 6.36
PAQ9a 3 97310 3.094 6.07
PAQ9a 4 97137 3.187 5.23
PAQ9a 5 96795 3.359 6.36
PAQ9a 6 97112 3.625 6.07
PAQ9a 7 97527 4.063 5.23
PAQ9a 8 98465 4.953 3.98
PAQ9a is the first and the only PAQ that's practical. Congratulations, no other (L)PAQ tested even came close.
That's because of LZP greatly reducing size for CM, right?
P.S.:
I write"Efficiency(maximumcompression.com)" because maximumcompression.com is the most popular site that uses this function, I don't know and don't care who's the founder.
EDIT:
I forgot to attach the results.