This is a coder that attempts to make use of actual tiny redundancies
left after some compression algorithms, especially bitcode-based ones.
The idea is rather simple - there're some (combinatorical) coding methods
that always compress any data (not counting the space required to store stats).
For example, a 1024-bit block with 0s count = 412 and 1s count = 612 yields redundancy of 33 bits,
which can be used to encode counts, and still save some space:
So cdm runs a parsing optimization algorithm that tries to split the data
into a series of small blocks which can be either stored or encoded.
The method is asymmetric, so decoding is much faster than encoding.
Encoding can be also faster with some weaker optimizer settings.
Unfortunately I wasn't able to compress any paq archives with this version,
but results for any (unknown) bitcoded formats seem to be interesting.
Also, it seems that cdm can be used for file analysis/format detection -
here we can clearly see that jpeg uses bit7-0 bit order, while deflate
uses bit0-7:
- saved 1-2 bytes per file on EOF coding
- removed price cache which reduced encoder memory usage from 1G to 10M
- support for blklen!=maxblk for uncompressed blocks
- 3 freq0 ranges instead of 2
- speed optimizations
LZ4.exe is added to the demo, use LZ4 -12 first if input file is compressible,
because CDM doesn't include LZ compression.
The idea is rather simple - there're some (combinatorical) coding methods
that always compress any data (not counting the space required to store stats).
Sounds like what is usually referred as "enumerative coding" - use that:
binomial(n,k) = binomial(n-1,k) + binomial(n-1,k-1)
which is quite intuitive: to put k ones in n positions, we can put 0 (left) or 1 (right) in the first position and k or k-1 in the remaining positions, getting the above recurrence.
So to enumerate (n,k) combinations, we can use [0..binomial(n-1,k)-1] to enumerate those starting with 0 and the remaining for those starting with 1.
While for standard entropy coding it is sufficient to know approximate probability p~k/n, paying for inaccuracy with increased size (Kullback-Leibler bits/symbol), enumerative coding requires knowing the exact number of ones (k).
It is great if we indeed need encode/enumerate combinations, or generate a random combination.
It seems costly as it requires knowing and operating on binomials being large numbers ... but I remember seeing a paper claiming that it can be quite fast (...comparing to AC).
A huge drawback is that it is usually used only for binary alphabet and single static probability, what is insufficient for real data compressors ...
I see it could be generalized for large alphabet, e.g. ternary:
binomial(n|k,l) = binomial(n-1|k-1,l) + binomial(n-1|k,l-1) + binomial(n-1|k,l)
but it seems even more costly and varying probability is difficult ... could be realized e.g. by interleaving a few fixed probability coders.
Similar concept as in enumerative coding is used in Fisher's PVQ (Pyramid Vector Quantizer: http://ieeexplore.ieee.org/document/...number=1057198 ): encode points for l1 sphere: sum_i |x_i|=k (again using similar recurrence), which is the heart of Daala PVQ (Perceptual Vector Quantizer).
> Sounds like what is usually referred as "enumerative coding"
Yes, but its implemented with a normal (although precise) rangecoder in this case.
The difference from usual adaptive arithmetic coding is that freq0 is known in advance
(is encoded before data block), so instead of starting with freq0=1,freq1=1 and incrementing
them, i'm starting with freq1=blklen-freq0 and decrementing.
In this case, the size of the code generated by rc is very close to log2( (freq0+freq1)!/freq0!/freq1! ).
Btw, adaptive model would have codelen = log2( (freq0+freq1+1)!/freq0!/freq1! ),
which is the same as encoding freq0 value with probability = 1/(blklen+1)
(it can have values in range 0..blklen) in decrementing scheme.
But then, there we can do better and use some model to encode the freq values.
Which is important, because with compressed data we would have a pretty good
idea about freq0 value, rather than assuming it to be purely random.
> It seems costly as it requires knowing and operating on binomials being large numbers...
> but I remember seeing a paper claiming that it can be quite fast (...comparing to AC).
I made a fixed-point log2(n) table, then log2(n!) table using it.
Then, log2( C(c0+c1,c0) ) = LUT[c0+c1]-LUT[c0]-LUT[c1].
Here I'm using this to estimate the codelengths for the parsing optimizer.
But some very long time ago I also made an actual enumerative encoder -
it precalculated binomials up to a fixed block size using Pascal's triangle.
Sure, it was multiplication-free and pretty fast, but there was no
compression gain compared to a rangecoder, and all the symbol decomposition
and blockwise coding were complicated, and the complete coder wasn't
faster than rc either.
As to further cdm improvements, I have the following ideas:
- Use adaptive linear counters for flags etc (cdm currently uses static probabilities).
- Alternative model with bit[i]^bit[i-k] transform before encoding
- Alternative model with byte value bitmask for a block
- Simple LZ4-like LZ with its own parsing optimization
Originally I wanted to make a byte model and compress a frequency table of byte values,
but it would require supporting much longer blocks, and parsing optimization
is pretty slow already even with maxblk=256.
Some my tests of cdm in attached Excel file. There also JPG file with the same table.
My insights are like follows:
1) This algorithm is very powerful for low redundant files (as you wrote) - for such files it has an some gains even over 7zip with maximal compression - for my testbed it was E.TIF (internally compressed), F.JPG and J.EXE (internally packed).
2) But, some other files are also quite good compressed by pure cdm1 algorithm - f.e.:
==== 0.WAV - this audio file is compressed only 7.9% worse than 7zip with compression ratio about 60%
==== G.EXE - this executable file is also close to 7zip - only 8.7% worse with compression ratio about 75%
3) For 0.WAV file compression of pure cdm1 is about 25% better than lz4!
4) For text files compression of pure cdm1 is only about 6.6% worse than lz4.
Here is the file with v3 test. I've added comparison to lz4 -12 scores.
About v2:
Pure cdm v2 is about 1.1% better than lz4 -12 for whole my testbed.
Adding cdm on preprocessed files by lz4 -12 adds about 7% to lz4 scores.
Combining best scores from both pure cdm and preprocessed gives 9.9% better score than lz4.
About v3:
Generally for pure cdm scores v3 get about 5% gain to v2.
Pure cdm v3 is about 6.0% better than lz4 -12 for entire my testbed !!!
Adding cdm on preprocessed files by lz4 -12 adds only about 2.6% to lz4 scores. Due to better cdm compression lz4 preprocessing gets less important.
Combining best scores from both pure cdm and preprocessed gives 10.7% better score than lz4.
Its also possible to compress CDM_test.7z from 32,455 to 32,438.
But overall its not quite what I wanted. There're better algorithms for wavs and images, etc.
What about actual compressed formats? .aac? videos? .rar? .cab/.wim? image raws? .flif? LZO? various exe packers etc?
Well, flif doesn't seem compressible, but otherwise here's an example:
1. Original size of calgarytest.zpaq is 1,003,792
2. enwik6a is first 2,097,152 (=2M) bytes of enwik)
3. enwik6a_mct.rar is compressed with ppmd vH -o16 -m256 (rar -mc16:256t+)
>But overall its not quite what I wanted. There're better algorithms for wavs and images, etc.
Yes I Know. Good wavs compression is an only side effect but it's quite interesting because it's quite effective...
>What about actual compressed formats? .aac? videos? .rar? .cab/.wim? image raws? .flif? LZO? various exe packers etc?
I'll test some othere files. Generally after CM compressors there no any improvements - I've tested files after CMIX12 and some CMV files. I'll check more files too.
Only cdm. cdm1 was necessary because p_maxblk was static, and coders optimized for more redundant data (eg. jpg,ogg) and less redundant (.zip)
produced significantly different results.
But now that I made p_maxblk adaptive, it doesn't matter that much anymore.
> Good wavs compression is an only side effect but it's quite interesting because it's quite effective...
Its actually BWT compression. You can probably get better results with normal bwt postcoders, rather than cdm.
> Generally after CM compressors there no any improvements -
With some faster ones, like ccm, there could be. Eg.
> I've tested files after CMIX12 and some CMV files. I'll check more files too.
Well, maybe you can check more redundant files? Like paq'ed bmps or something?
But overall I don't have much ideas for cdm after arithmetic coding, if bytewise BWTS doesn't help.
I did try a permutation model (each byte value appears only once in a block) - on a block like that of length 256
we can save 45.5 bytes, for example.
And it does help a little, but cdm's parsing optimizer apparently only uses such blocks as a replacement
for uncompressed ones, and even that very rarely - like 4-5 times per file.
And of course it also slows down the encoding, so I didn't keep it.
I've tested some Audio and Image files and some of formats with CDM v5.
There are two approach:
1. I've found 6 songs and 6 images and convert it into few different formats then compress. Some insights:
------ AAC and MP3 formats got a 1.5-2.2% gain over 7zip compression (-mx9) - respectively in average 1.9% and 1.7%;
------ FLIF format got the same compression ratio like 7zip (-mx9) which means -1.4% - flif is completely incompressibe;
------ TIFF (LZW internally compressed), JPG and PNG got the biggest gain to 7zip from Inage formats - respectively 2.5%, 3.2% and 4.3%;
------ RAW format is exception because it's high compression ratio - about 54% - for this format CDM have much worse score than 7zip but there is an one strange case - RAW5 image (lot's stars on black sky) - for this image CDM compression is only 1.2% worse than 7zip despite almost 50% compression ratio of this file;
2. I've got two files 1'st audio song and 3'rd Image and then convert it to multiple formats then compress. Some insights for these two files:
------ WAV and AIFF raw formats are quite good compressed by CDM - only about 5-6% worse than 7zip. This is similar score to my testbed good wave compression CDM ratio - 6.7% worse trhan 7zip;
------ WMA format is about 9% better compressed than 7zip - compression ratio =21% vs. 13% of 7zip;
------ other audio formats gains a little from CDM - average 1% better than 7zip;
------ raw or similar to raw formats (BMP, EMF, ICO, PPM, RAW, TGA) with huge compression ratio are much worse compressed by CDM than 7zip;
------ majority of formats have small gain compared to 7zip - similar like audio formats = 0.9% - GIF, JLS, JP2, good quality JPG, JPM, images in PDF, PGM, TIFF, WEBP...
------ low quality JPG (probably due to it's small size) is quite well compressed - 12% better than 7zip - compression ratio 63%;
and more interesting cases:
------ PBM file is also 11.5% better compressed than 7zip - compression ratio 93%;
------ PCX file have 5% gain to 7zip with compression ratio = 57%;
------ Reymont.pdf - text PDF file from Silesia Corpus - 12% better score than 7zip with quite huge compression ratio = 83%;
------ these three examples above means that for some kind of data or fileformats CDM have very competitive compression ratio.
I've attached Excel file with all scores and two JPG with tables with both approach figures.
I have a question - what is a range of block sizes to set? Default is 5M but how are the limitations?
Thanks, this is more interesting, in particular I completely forgot about .wma and .pcx. Also .flac.
Can you check other lossless audio codecs too?
However,
1. You can try compressing redundant files with lzma2, then cdm. Note that lzma/plzma don't work - only lzma2 would use "stored" blocks for "incompressible" data.
2. Alternatively srep or lz4 can be used - in some cases LZ4 can be better, because lzma2 blocks are not very small (~64k)
3. Better also check files with precomp - for example, raw image formats frequently contain large embedded jpegs,
and we already have better (than cdm) solutions for deflate and jpeg recompression.