Hello Every, Here is my CSC 3.2 Alpha 2.
Here is csc on compression ratings
It's a complete rewritten of CSC3.1, aiming at better ration and faster speed.
The main algorithm of them are almost same -- LZ77 + Ari.
But to achieve a better performance, I made such changes:
1. Abandon HashChain match finder. Use hashTable instead. And the number of match candidates is only 1-6 (hidden parameter -mx is 12). That means in -m1 there is only 1 hash-6-bytes candiate on each string matching.
2. On most data, LZ is very good. But on bad data like audios. It is very slow. So only find matches on the beginning of small blocks and insert hash-6 strings every several bytes. Then it would performs like skip (30MB/s+) but can find duplicate blocks. (try Tar contains several same zipped files)
3. Detection: CSC3.1 calculates each column's(1\2\3\4\8\16) entropy to determine if delta is needed every 64KB. It's slow. Now such code instead:
while(progress<size-12)
{
sameDist[0]+=abs((signed)src[progress]-(signed)src[progress+1]);
sameDist[1]+=abs((signed)src[progress]-(signed)src[progress+2]);
sameDist[2]+=abs((signed)src[progress]-(signed)src[progress+3]);
sameDist[3]+=abs((signed)src[progress]-(signed)src[progress+4]);
sameDist[4]+=abs((signed)src[progress]-(signed)src[progress+8]);
sameDist[5]+=abs((signed)src[progress]-(signed)src[progress+12]);
progress++;
}
The MinBlockSize is 8KB. If largest sameDist[i] and smallest sameDist[i] differs much that often indicates this block is tables and smallest sameDist[i] is the Channel Num.
4. Improved parsing and model. There are 5 kinds of packes in LZ out stream:
Normal Match, Matches with Repeat Distance (4), 1-Byte match (last distance)(helps much in data tables), Same match(last len and distance), literal.
Literal use order-1 (3MSBit as context). Match distances coding use match len as context.
5. Range coder changed to the same as LZMA. Seems 10-20% faster on decompression compared to CSC31's coder.
6. The dictionary size is enlarged much ---320MB (actually 500MB+ but seems meaningless). Due to the changes in match finder, memory usage is only 40M+dictionary size in -m1 mode. Decompression needs 5M+dictionary size, i/o buffer included.
Still need to improve:
LZ Parsing has bugs so far. Its price evaluation doesn't work. Check e.coli -- Theorical should has no matches and be compressed to 2bpb.
The delta coding is far from statisfied. Thanks to Nania, but his works distinguish datas by header of file. So what sould I do if I don't know the width of bitmap? Sami advices linear preditor + order 0 is good enough. But I lack the knowledge of linear preditor. Also change CSC to archiver is a kind of solution.
Technical changes in CSC3.2 Alpha 3:
1. Removed repeat match in LZ. Very little ratio improved on most files.
2. Changed literal coding a little lead to very little ratio improved on all files.
3. Changed delta coding. Improved decoding speed.
4. Cached some bits in matchfinder, improved compression speed.
Technical changes in CSC3.2 Alpha 4:
1. Improved compression speed on data that hard to compress
2. Fixed a bug that hurts compression.
3. Reset mode: -m0 -> -m1, -m1 -> -m2, -m2/3 -> -m3. default -m2 now.
4. Dictionary size is more flexible now, can be any KB ranges from 32KB to 512MB.
Technical changes in CSC3.2 Alpha 5:
1. Improved compression speed a little.
2. A very weak and experimental DICT (not bulat's) preprocessor added. Improved %1 ratio on text.
Technical changes in CSC3.2 Alpha 6:
1. Very small improvement, most about DICT
2. This is the last version of CSC. Next version will be an archiver, and maybe called CSArc.
Technical changes in CSC3.2 Beta 2:
New version emphasize more on compression ratio rather than compression efficiency.
1. Rewrite LZ part for -m2/-m3 mode. Removed previous -m1 mode,now -m1 is previous -m2.
Now the LZ(-m2/3) use better parsing, but much slower than before.
Now literal coder is complete order-1 while previous use only 3-MSB.
2. New delta detector, but seems not better than before.
3. Now it's an archiver, many codes are from YZX(or from Sami Runsas).
4. Some other small modification.
Changes in CSC3.2 Final:
Fixed decompression error in previous version.