Hi compression gurus,
Which algorithms offer the best compression rate for compressing small text(length < 2000 bytes)?
Any link to java open source code is highly appreciated as well.
Thanks
Hi compression gurus,
Which algorithms offer the best compression rate for compressing small text(length < 2000 bytes)?
Any link to java open source code is highly appreciated as well.
Thanks
If the files have something in common then a pre-trained context mixing compressor should obtain top compression ratio.
Input RAM-disk:
898 bytes - text file
Output RAM-disk:
1,520 bytes, 0.075 sec. - 0.041 sec., WinZpaq - 6
1,146 bytes, 0.543 sec. - x.xxx sec., eXdupe - 3
779 bytes, 0.005 sec. - 0.005 sec., Qpress - 3
765 bytes, 0.006 sec. - 0.006 sec., LZ4 - 2
746 bytes, 0.055 sec. - 0.050 sec., FreeArc - 9
690 bytes, 0.008 sec. - 0.006 sec., 7-Zip - 9
625 bytes, 0.010 sec. - 0.005 sec., lzturbo - 49
622 bytes, 0.014 sec. - 0.012 sec., WinRAR - 5
621 bytes, 0.020 sec. - 0.018 sec., ZCM - 7
600 bytes, 0.107 sec. - 0.102 sec., NanoZip - c
586 bytes, 0.124 sec. - 0.008 sec., Bsc - 8
Input RAM-disk:
1,480 bytes - text file
Output RAM-disk:
1,766 bytes, 0.077 sec. - 0.042 sec., WinZpaq - 6
1,424 bytes, 0.543 sec. - x.xxx sec., eXdupe - 3
1,201 bytes, 0.006 sec. - 0.005 sec., LZ4 - 2
1,148 bytes, 0.005 sec. - 0.004 sec., Qpress - 3
996 bytes, 0.055 sec. - 0.050 sec., FreeArc - 9
983 bytes, 0.008 sec. - 0.006 sec., 7-Zip - 9
933 bytes, 0.010 sec. - 0.005 sec., lzturbo - 49
933 bytes, 0.021 sec. - 0.020 sec., ZCM - 7
913 bytes, 0.013 sec. - 0.011 sec., WinRAR - 5
872 bytes, 0.107 sec. - 0.102 sec., NanoZip - c
854 bytes, 0.125 sec. - 0.007 sec., Bsc - 8
Input RAM-disk:
3,199 bytes - text file
Output RAM-disk:
2,418 bytes, 0.081 sec. - 0.046 sec., WinZpaq - 6
2,350 bytes, 0.006 sec. - 0.005 sec., LZ4 - 2
2,170 bytes, 0.543 sec. - x.xxx sec., eXdupe - 3
2,167 bytes, 0.005 sec. - 0.004 sec., Qpress - 3
1,759 bytes, 0.024 sec. - 0.023 sec., ZCM - 7
1,753 bytes, 0.011 sec. - 0.005 sec., lzturbo - 49
1,747 bytes, 0.008 sec. - 0.006 sec., 7-Zip - 9
1,691 bytes, 0.013 sec. - 0.011 sec., WinRAR - 5
1,657 bytes, 0.055 sec. - 0.050 sec., FreeArc - 9
1,560 bytes, 0.122 sec. - 0.008 sec., Bsc - 8
1,559 bytes, 0.110 sec. - 0.104 sec., NanoZip - c
Input RAM-disk:
7,626 bytes - text file
Output RAM-disk:
4,782 bytes, 0.006 sec. - 0.005 sec., LZ4 - 2
4,492 bytes, 0.005 sec. - 0.004 sec., Qpress - 3
3,858 bytes, 0.543 sec. - x.xxx sec., eXdupe - 3
3,848 bytes, 0.089 sec. - 0.057 sec., WinZpaq - 6
3,585 bytes, 0.031 sec. - 0.030 sec., ZCM - 7
3,533 bytes, 0.012 sec. - 0.006 sec., lzturbo - 49
3,477 bytes, 0.014 sec. - 0.012 sec., WinRAR - 5
3,449 bytes, 0.009 sec. - 0.006 sec., 7-Zip - 9
3,150 bytes, 0.124 sec. - 0.008 sec., Bsc - 8
3,104 bytes, 0.056 sec. - 0.051 sec., FreeArc - 9
3,006 bytes, 0.114 sec. - 0.111 sec., NanoZip - c
All one thread.
Last edited by Sportman; 17th September 2013 at 15:24.
Hacker (29th January 2017),prese (17th September 2013),samsat1024 (18th September 2013)
Thanks Sportman.
Are any chances to find a Java implementation for Nanozip or for BSC?
Thanks
I don't think there are Java sources, but you can ask:
http://libbsc.com
http://www.nanozip.net
Because headers can be big of popular archivers and have big impact at small test files, I tested also some less popular (zip excluded) archivers:
Input:
898 bytes, text file
Output:
995 bytes, zlite
919 bytes, kwc
803 bytes, mtcc
696 bytes, sharc 1
671 bytes, crush x
661 bytes, rar zip max
640 bytes, packet
501 bytes, bcm
470 bytes, ppmx
466 bytes, paq8pxd7 8
Input:
1,480 bytes, text file
Output:
1,501 bytes, kwc
1,299 bytes, mtcc
1,293 bytes, zlite
1,136 bytes, sharc 1
1,028 bytes, crush x
946 bytes, rar zip max
941 bytes, packet
769 bytes, bcm
727 bytes, ppmx
702 bytes, paq8pxd7 8
Input:
3,199 bytes, text file
Output:
3,203 bytes, kwc
2,744 bytes, mtcc
2,369 bytes, sharc 1
2,122 bytes, zlite
1,955 bytes, crush x
1,755 bytes, packet
1,715 bytes, rar zip max
1,485 bytes, bcm
1,407 bytes, ppmx
1,326 bytes, paq8pxd7 8
Input:
7,626 bytes, text file
Output:
7,349 bytes, kwc
6,499 bytes, mtcc
5,340 bytes, sharc 1
4,021 bytes, zlite
3,956 bytes, crush x
3,560 bytes, packet
3,492 bytes, rar zip max
3,072 bytes, bcm
2,916 bytes, ppmx
2,712 bytes, paq8pxd7 8
Last edited by Sportman; 17th September 2013 at 17:56.
Hacker (29th January 2017),samsat1024 (18th September 2013)
With smaller text files to see better header size influence:
Input RAM-disk:
346 bytes, test file
Output RAM-disk:
1,233 bytes, 0.067 sec. - 0.037 sec., WinZpaq - 6
871 bytes, 0.542 sec. - x.xxx sec., eXdupe - 3
499 bytes, 0.053 sec. - 0.047 sec., FreeArc - 9
425 bytes, 0.004 sec. - 0.003 sec., Qpress - 3
396 bytes, 0.007 sec. - 0.005 sec., 7-Zip - 9
344 bytes, 0.018 sec. - 0.015 sec., ZCM - 7
341 bytes, 0.004 sec. - 0.004 sec., LZ4 - 2
338 bytes, 0.011 sec. - 0.010 sec., WinRAR - 5
315 bytes, 0.106 sec. - 0.096 sec., NanoZip - c
313 bytes, 0.009 sec. - 0.005 sec., lzturbo - 49
312 bytes, 0.027 sec. - 0.006 sec., Bsc - 6
Input:
346 bytes, test file
Output:
714 bytes, zlite
390 bytes, rar zip max
370 bytes, kwc
368 bytes, packet
317 bytes, crush x
314 bytes, mtcc
300 bytes, sharc 1
232 bytes, paq8pxd7 8
225 bytes, bcm
221 bytes, ppmx
Input RAM-disk:
186 bytes, text file
Output RAM-disk
1,194 bytes, 0.066 sec. - 0.035 sec., WinZpaq - 6
768 bytes, 0.652 sec. - x.xxx sec., eXdupe - 3
422 bytes, 0.053 sec. - 0.048 sec., FreeArc - 9
298 bytes, 0.007 sec. - 0.006 sec., 7-Zip - 9
259 bytes, 0.004 sec. - 0.004 sec., Qpress - 3
233 bytes, 0.012 sec. - 0.010 sec., WinRAR - 5
232 bytes, 0.025 sec. - 0.005 sec., Bsc - 6
225 bytes, 0.018 sec. - 0.014 sec., ZCM - 7
216 bytes, 0.104 sec. - 0.097 sec., NanoZip - c
213 bytes, 0.009 sec. - 0.005 sec., lzturbo - 49
202 bytes, 0.004 sec. - 0.004 sec., LZ4 - 2
Input:
186 bytes, text file
Output:
623 bytes, zlite
297 bytes, rar zip max
240 bytes, packet
208 bytes, kwc
201 bytes, crush x
186 bytes, sharc 1
174 bytes, mtcc
156 bytes, paq8pxd7 8
144 bytes, ppmx
137 bytes, bcm
Last edited by Sportman; 18th September 2013 at 00:05.
Hacker (29th January 2017),samsat1024 (18th September 2013)
You could try try ccm or ccmx too. Quite old (you can find it in this forum) but still often the 'best' J
For small text compression its hard to beat BWTS followed by simple RLE ARI bijective type of things.
I usually showed how the file "BANANAS" compresses to smaller than 7 bytes which is hard to do with
most compressors. Even The "SCOTT" transform followed by bijective LZW does pretty good. All
these programs on this site somewhere.
186 bytes to 151 bytes, ccmx 7
346 bytes to 241 bytes, ccmx 7
898 bytes to 547 bytes, ccmx 7
1,480 bytes to 872 bytes, ccmx 7
3,199 bytes to 1,665 bytes, ccmx 7
7,626 bytes to 3,370 bytes, ccmx 7
186 bytes to 144 bytes, cmm 43
346 bytes to 244 bytes, cmm 43
898 bytes to 534 bytes, cmm 43
1,480 bytes to 799 bytes, cmm 43
3,199 bytes to 1,499 bytes, cmm 43
7,626 bytes to 2,996 bytes, cmm 43
Some results on enwik3 (first 1000 bytes of enwik9).
Size differences for small files are mainly due to archive overhead, so single file compressors do better than archivers. zpaq journaling archive format has a lot of overhead for versioning, deduplication and error recovery. Option -m s0.0c3.0.255 specifies streaming format (s), 1 MB block size (0), no preprocessing (0), a direct context model (c) with fast adaptation (3 = max countCode:302 enwik3.pmm (ppmonstr) 314 enwik3-7.paq8px 320 enwik3-o8-m256-r1.pmd 322 enwik3-d6-m16M-f16M.ctw 343 enwik3-9.gz 354 enwik3-7.ccmx 355 enwik3-7.ccm 377 enwik3-c1000000000.dmc 391 enwik3-9.bz2 466 enwik3-mx.7z 506 enwik3-ms0.0c3.0.255-fragile.zpaq 530 enwik3.Z 570 enwik3-9.zip 628 enwik3.fpaq0 1,000 enwik3 1,133 enwik3-m5.zpaq 1,266 enwik3-m4.zpaq 1,306 enwik3-m3.zpaq 1,321 enwik3-m6.zpaq 1,488 enwik3-m2.zpaq 1,503 enwik3-m1.zpaq, no special contexts (0), and order 1 (255). -fragile removes header locator tag and checksum to save 23 bytes. There is still the overhead of the block header with the algorithm description and segment header with the file name and size, but the algorithm is much simpler than LZ77 (-m 1, 2, 3), BWT (4), or multiple CM (5, 6).
Last edited by Matt Mahoney; 17th September 2013 at 23:11.
Hacker (29th January 2017),samsat1024 (18th September 2013)
Input RAM-disk:
1,000 bytes, enwik3
Output RAM-disk:
1,321 bytes, 0.072 sec. - 0.039 sec., WinZpaq - 6
972 bytes, 0.542 sec. - x.xxx sec., eXdupe - 3
577 bytes, 0.057 sec. - 0.050 sec., FreeArc - 9
533 bytes, 0.022 sec. - 0.018 sec., ZCM - 7
493 bytes, 0.005 sec. - 0.005 sec., Qpress - 3
466 bytes, 0.008 sec. - 0.006 sec., 7-Zip - 9
463 bytes, 0.109 sec. - 0.103 sec., NanoZip - c
453 bytes, 0.006 sec. - 0.006 sec., LZ4 - 2
430 bytes, 0.028 sec. - 0.008 sec., Bsc - 6
409 bytes, 0.014 sec. - 0.011 sec., WinRAR - 5
405 bytes, 0.011 sec. - 0.006 sec., lzturbo - 49
Input:
1,000 bytes, enwik3
Output:
879 bytes, mtcc
837 bytes, zlite
833 bytes, kwc
712 bytes, sharc 1
463 bytes, rar zip max
457 bytes, cmm 43
431 bytes, bcm
399 bytes, packed
391 bytes, crush x
355 bytes, ccm 7
354 bytes, ccmx 7
313 bytes, ppmx
286 bytes, paq8pxd7 8
Hacker (29th January 2017),samsat1024 (18th September 2013)
BANANAS happens to have one segment that usefully repeats, but for words that small, usually almost all the compression is order 0, basically just like plain Huffman would do. The BWTS would be unlikely to do any harm by itself, but it would depend on the second step transform, etc.
Last edited by nburns; 18th September 2013 at 19:33.
Of course since BWTS of BANANAS is BNNSAAA it works out nice.
But if all seven letter different then there is no harm.
Then again if the letters are already grouped then a BWTS will make it usually worse.
For really small files. If the file is random its best not to compress at all
The more you know about the file the better you can make a compressor for it.
For example if its byte data and random just use an order zero bijective arithmetic not a Huffman if
the goal is to make the average the smallest.
For any file of IID file where the symbols are any byte. For a group of random files the average
of the bijective zero order arithmetic will compress smaller on the average as the number of random
files increases compared to the bijective huffman compressor.
People get confused about the above since far any arbitrary file of the type above sometimes the
arithmetic compresses better and sometimes the huffman but in general the arithmetic wins.
A so called plain Huffman where you have to store a table in front would not be a good compressor
for such small files because its not possible to make it bijective. Though there is a hybrid type of
huffman with a header of sorts which contains the symbols used and where they first occur that can
be made bijective but the second part of file is still adaptive.
Also, fully adaptive Huffman coding does not compress as well (or as fast) as arithmetic coding using the same model.
I agree and have coded fully bijective adaptive Huffman coders. And yes I prefer them over plain Huffman coders.
But even though they on the average do better plain Huffman coders. The bijective adaptive arithmetic coder beats
the adaptive bijective Huffman on the average. I have implemented an adaptive Huffman coder using the code of
an arithmetic coder.
However there is more than one way to update the tree see http://www.ics.uci.edu/~dan/pubs/DC-Sec4.html
and there are several ways to do the adaptive arithmetic.
the best general adaptive Huffman when it gets a file of 256 bytes all different
the first byte as 8 bits. Then next 255 bytes have a single bit that tells if its
a new character or not. The above 263 bits of compression for the adaptive Huffman can
also be used in a simple arithmetic coder. Where you have a weight of 50/50 for
the case of seeing a new character during compression. Of course most reasonable
arithmetic would assign less than a bit to a new character if many in a row but
this is for the comparison of the arithmetic coder versus an adaptive Huffman.
Next the bits for worst case Huffman when assigning code to new characters
127 at 8 bits 64 at 7 bits 32 at 6 bits 16 at 5 bits 8 at 4 bits 4 at 3 bits 2 at 2 bits 1 at 1 bits 1 at zero bits
263 + 127*8 + 64*7 + 32*6 + 16*5 + 8*4 +4*3 + 2*2 + 1 *1 + 1*0 = 2048 bits or 256 bytes with bijective ending 256 bytes
know look at best case adaptive Huffman
128 at 7 bits 64 at 6 bits 32 at 5 bits 16 at 4 bits 8 at 3 bits 4 at 2 bits 2 at 1 bit =
263 + 128*7 + 64*6 + 32*5 + 16*4 + 8*3 +4*2 + 2*1 + 1 *0= 1801 bits or 225.125 bytes with bijective endings 226 bytes
the simple adaptive arithmetic
263 + log (255!) / log(2) = 1938.9962872242146194061476793149... equal 242.374535903... bytes for bijective file length of 242 or 243 bytes depending on the order of data.
Note in all 3 cases your left with each symbol having a weight of 1/256 or 8 bits when you process the 256 bytes of data.
Note the average of the arithmetic for the set of all files where each symbol used once is smaller than the average of the compressed set using the adaptive Huffman. The reason is the simple counting argument. The arithmetic is by default optimal for the average. Also note
the average of the smallest possible Huffman length to the longest Huffman length is smaller than what the arithmetic does. However when using weights of all files the Huffman compresses more files to a longer length than the arithmetic.
I wonder how adaptive Huffman stacks up to splay tree compression. This was mentioned in a CS class I took (the CS department was big on splay trees... this may have had something to do with Danny Sleator being on the faculty). I did a search on it recently, and I managed to turn up a really old paper and some crufty C code that wasn't easy to play with.
The difference between Huffman and arithmetic coding is that Huffman codes cannot have fractional bit lengths. The worst case is when all characters in the input are the same and therefore almost perfectly predictable. Huffman coding codes to 1 bit per byte. Arithmetic coding codes to near 0 bits per byte.
If you are doing arithmetic coding optimally, it should never do worse than Huffman. In the case where the optimal bit lengths turn out to always be whole numbers, arithmetic coding should effectively degrade into Huffman. Otherwise, it should take advantage of any opportunity to do better.
No this is a common misunderstanding. Since for "optimal adaptive arithmetic coding" You do have the probabilities of each future symbol to be processed more accurately than for "optimal adaptive huffman" based on PAST data but the next symbol
may not be what your hoping for.
Example suppose they are only 3 symbol types in a file and you have processed all 3 types so no more unknown symbol types.
say 5 A have occurred
and 4 B have occurred
and 3 C have occurred
the optimal Huffman will assign next like this
A 1
B 01
C 10
so that it writes 1 bit for A and 2 bits B or C
but the optimal arithmetic writes for A B C as follows
-log(5/12)/log(2) bits for A 1.26303440583... bits for A
-log(4/12)/log(2) bits for B 1.58496250072... bits for B
-log(3/12)/log(2) bits ofr C 2 bits for C
in this case where I am assuming more symbols to follow so no file ending treatment needed yet.
if SYMBOL A occurs next the file will be shorter for the Optimal Huffman by .26303440583... bits
If SYMBOL B occurs next the file will be shorter for the Optimal arithmetic by .41503749927... bits
If SYMBOL C occurs next it makes on difference since both cause the same length
if the real probabilities of file match closer to the arithmetic it will create the shorter file.
Only at rare places in the file will they every have the same possible outputs for a next symbol.
And even at there places when the data is processed the updated table for the very next set of
tables will NEVER match each other.
Well, an bijective arithmetic coder can't be always better or equal to bijective huffman coder because that would defy the counting theorem.
Let's suppose we have an compressor that treats the input as bijectively huffman encoded, decodes it and outputs as bijectively arithmetic encoded. If it compresses in some case, then it must expand in some other case.
Anything bijective must be exactly optimal for some distribution. You could discover the distribution by feeding the decoder files starting from the smallest and seeing what comes out.
Huffman can only make coarse-grained changes, though, so it doesn't do very well on real data.
There must be a trade-off somewhere that benefits Huffman. Since Huffman can't make any symbol use <1 bit, that also means that it can't over-adapt to a string of identical symbols and incur a large cost to stop. But that's really not a problem of arithmetic coding, it's how the probabilities are adjusted.
Very good point. But when looking at all cases of compression of a file of given length the bijective arithmetic will also compress to an average that is smaller in length than the bijective huffman.
Here is an example by looking at a subset and the subset is any file of the length of concern and all its permutations. Well the
reasoning that follows hold for all subset of like this for files of all lengths. The bijective arithmetic compressor compresses
all permutations of a file to either n+1 or n bits while the Huffman would have a wider span of bit lengths and that makes
its average longer.
To help see this suppose you have a tree for N bits suppose N is 8 then you have 256 cases. each maps to 8 bits the tree
is a valid tree and the average is 8 bits per symbol. know do the least possible to make the tree have some 9 bits long
in this case 2 symbols would be 9 bits long 1 symbol 7 bits long and 253 symbols 8 bits long the average length of the
compressed stream just increased to 8.00390625 This is why when compressing purely random data nothing can beat a flat
copy of one file to the other for best average. The hope of a compression program is that files of use not random.
Actually I think the places where a Huffman would be of more use than an arithmetic would be extremely rare since Matt has stated
that Huffman compression slower.
As an example from geometry. Of the two which is best shape for a corral a rectangular shape or a square shape. Well the
answer has to be the rectangular since the square is a subset of rectangular.
Its the same thing with compression it has to be the arithmetic since Huffman is a subset of arithmetic. Maybe by trying to
force a nonoptimal soultion by trying to find the closest poor fit of a Huffman takes more time and work I don't know.
I am sure many would argue that they are many places since Huffman easier for most to grasp. Yet it hard for me to
think of a place where arithmetic not better or the same since Huffman just a subset of arithmetic.
It's a little bit of an apples to oranges comparison, because Huffman, at least in its original form, computes an exact solution based on a given set of symbol frequencies, and arithmetic coding is just a technique for encoding arbitrary symbols within arbitrary ranges into one big number. Huffman sort of views symbols as separate messages that obey some known distribution, and it optimally solves that problem. Arithmetic coding is compared to taking a number with a mixed radix and converting it to binary.
Static Huffman is blazingly fast, but the tree is not so efficient to update incrementally. Since adaptive Huffman isn't really solving the right problem, anyway, there's not much point in using it.
Static Huffman is also a 2 pass process. The solution that Static Huffman computes is not the best. It is only conditional optimal because it forces the use of an unrealistic constraint. That unrealistic constraint is to force at every point in the file the use of a WHOLE number of bits for the compression. If you remove this unnecessary constraint you can get an optimal one pass arithmetic coder.