function compress(char* source, char* dest, int source_size)
{ Fill in hash_table[4096] with 0xFFFFFFFF;
anchor=source;
// Main loop
while (rest of source_size > 3)
{ hash=calcHash(U32 *source);
if (hash_table[hash] == 0xFFFFFFFF) || (source-hash_table(hash) > 65535) || (U32 *source != U32 *hash_table[hash])
{ hash_table[hash]=source++;
continue;
}
match_num=4;
Calculating additional number of matched bytes both from left and right of source:
match_num=number of total matched bytes
Coding llllmmmm oooooooo oooooooo [l]...[l] [literals] [m]...[m]:
number_of_literals=source-anchor;
match_num;
offset=source-hash_table[hash];
lz4 on level=1 shows compression ratio of enwik8 ~ 53%. But if I run my code similar to the above algorithm, it shows ratio = 56.9%.
My code shows ratio ~ 53% if size of hash_table = 32768 cells. What I do wrong?
Compression ratio in lz77 isn't just about finding the longest match, it's about the ratio of how much data can be represented using the least bytes. You can easily generate a float containing this ratio as 'complen/len', the bigger the value the better the match is. With this you can add thresholds for minimum ratios, this is what I use in my lz77 code, typically a ratio of 1.35 or higher allows for better indirect parsing (due to thresholds preventing bad matches from being encoded). Also a look-ahead should be taken into consideration, it's usually better to look forward at least 3 to 4 bytes for the algorithm to consider a different match. These two changes will offer better compression without having to change the size of the hash-table, though it might slow it down a little bit.
Thanks, but at this time I'm interesting in compression "on the fly" or at least "on the run". I'm going to rewrite my code in asm, so I want to know whether the algorithm contains a little bug that worsens the compression ratio...
With naive search I can do without hash table at all.
By the way: what ratio & time shows your lz77 code on enwik8 as compared with lz5 level 1?
Oof that's a comparison I'm not looking forward to, lz5 is a beast but I typically use mine to detect structures and help bwt compress better, so they are quite different in terms of their innards.
Here's the closest compression times to ratio I could get:
enwik8 to 42,568,137 in 8.7 sec. [lizard v1.0.0 -25]
enwik8 to 42,558,854 in 10.5 sec. [Jam 0.71 (unreleased, lz77 codec only)]
And for the fastest modes:
enwik8 to 52,988,350 in 0.4 sec. [lizard v1.0.0 -20]
enwik8 to 47,249,341 in 3 sec. [Jam 0.71 (unreleased, lz77 codec only)]
Lizard is a monster at decode speed (+800MB/s) whereas mine decodes at 300MB/s. There's so many trade-offs with lz77 that it's becoming harder and harder to get any faster than some of the current implementations. I guess your implementation really just depends on your use-case. For extremely fast compression (which is what you're looking for) the density library should be a good place for ideas on how to go faster: https://github.com/centaurean/density.
Another trick you could use is how lizard contains multiple streams for decoding, this allows for multiple pointers to decode data (literal, match, and offset streams) out-of-order and achieve a decent speed boost.
Thanks a lot. Some questions:
1. Does Jam 0.71 use something more complicated than simple hash to get ratio 47% and 42%?
2. Suppose I develop a code that on fast level compresses data 10% better than lz5/lizard with the same speed. Do I have any chance to sell my algorithm to developers commercial archivers/backup software etc.?
3. Do developers get a benefit as a result of publications like github.com/centaurean/density?
4. Where can I download sources (Delphi, FPK, C) of free for commercial use archiver to simply replace the compression/decompression functions with my own functions in asm?
1) Jam 0.71 uses a hash chain with a 2MB hash, hash chains are links from a hash function that mark positions of hash collisions, it requires up to 4n space + the hash but can be reduced down to 4n/k with a smaller window. Match finding varies from 1 byte look-ahead to 4 byte look-ahead with a heuristic for searching the chain; if we don't find a match within the first 8 comparisons then we move on. But if we have found at least 1 good match with the first 8 comparisons then we keep searching deeper up to 32 positions deep. The indirect parsing helps increase compression by about 3% which translates into about 1MB increase in compression on enwik8. The format is lz4 style token packing but with leb128 with carry of lower byte ranges, this theoretically allows distances up to 32GB large but realistically it's limited to 4GB. That's pretty much all there is to it.
2) Supposing you do make the fastest encoding lz77 algorithm for a given ratio, it may be attractive if the license you give it is fairly permissive and unobtrusive with regards to how customers use it. I wouldn't recommend adding clauses that could be used as an exploit to obtain rights to other people's work (*cough https://encode.su/threads/2780-ZSTD-license). If you distribute your code as a dll then you should be able to release it in a way such that your code is completely separate of any customer's compiled project, this is usually a safe bet for commercial applications. As for sales it really depends how you market it, personally I'd let the work speak for itself rather than trying to push hard. It's fine to write articles or blog posts comparing it to other solutions, but being bias is never really a good thing for your image, especially if you intend to make money off of it
3) The benefit of devs publishing their work is that it gains the author recognition and can potentially inspire other programmers on how to improve their own code.
4) I'm not too sure, if you're trying to replace closed-source solutions decoders with a speedier asm version that might be breaking their license if they don't want people reverse engineering it. You could possibly write your own decoder for something like winrar but the license says you need written permission from the creators of winrar to modify their decoder, and it would only apply to non-commercial use. As for sources it depends what you're looking to modify, no one can be stopped from reverse engineering code but you can release it as open source under such a restrictive license that no one can use it commercially, https://github.com/powzix/kraken/ is a great example of this. I guess it's really down to you weather you'd want to write it.
lz77, you can develop your compressor as plugin to freearc or 7-zip. CLS API supported by freearc is extremely simple to implement, so i suggest you to start with it
Thanks, but what happens, if my code will be so good (for example, 10% faster, and ratio 10% better than Lizard) that it can be used by companies like Yandex, Aсronis, E. Roshal?.. Could I make some money on this? Or it's useless to dream about it?
By the way: why members of this forum do not have inventions/patents like Shindlers transform, and publications on arxiv.org?
> if my code will be so good (for example, 10%
> faster, and ratio 10% better than Lizard) that
You can add your codec to some open-source projects (basically ones which use LZ4),
linux utilities, etc. If its actually good, people might start using it.
> it can be used by companies like Yandex, Aсronis, E. Roshal?..
Winrar is unlikely (they always used their own codecs before),
but you might get a job offer from some big company.
Don't expect anybody to just pay you for already existing sources, though.
Just that if your codec is good, you'd be obviously the best person to maintain it.
> Could I make some money on this? Or it's useless to dream about it?
It may be possible to sell a codec to some game studio - modern games are
basically self-extracting archives of various resources,
so using some compression is necessary, and backward compatibility doesn't matter much.
> By the way: why members of this forum do not have inventions/patents like
> Shindlers transform, and publications on arxiv.org?
Do you expect patents assigned to encode.su nicknames? :)
Well, some actually do have them.
Just keep in mind that you'd spend like $15k (and a lot of work) to get a US patent
for a codec remotely (non-US patents basically don't matter unless you work for government).
And then the only use for it would be showing your name in patents.google.com to people.
Well, or if you want to be a patent troll.
As to publications, they're only needed when you work at a university.
Its the best way if you only have a theory, complexity analysis of some algorithm, etc.
While for a programmer, its usually best to show off a working program -
in fact, any detailed algorithm descriptions would be bad, because somebody else
might patent it, etc. - https://encode.su/threads/2648-Publi...t-by-Storeleap
Here's the closest compression times to ratio I could get:
enwik8 to 42,568,137 in 8.7 sec. [lizard v1.0.0 -25]
enwik8 to 42,558,854 in 10.5 sec. [Jam 0.71 (unreleased, lz77 codec only)]
He-he, see results of my old (not yet optimized for speed) code in comparison with lizard v1.0.0:
Kernel Time = 0.328 = 00:00:00.328 = 7%
User Time = 3.671 = 00:00:03.671 = 83%
Process Time = 4.000 = 00:00:04.000 = 90%
Global Time = 4.406 = 00:00:04.406 = 100%
>timer mycode.exe enwik8.flz
Timer 3.01 Copyright (c) 2002-2003 Igor Pavlov 2003-07-10
Kernel Time = 0.375 = 00:00:00.375 = 16%
User Time = 0.812 = 00:00:00.812 = 35%
Process Time = 1.187 = 00:00:01.187 = 52%
Global Time = 2.281 = 00:00:02.281 = 100%
===============================
Timer 3.01 Copyright (c) 2002-2003 Igor Pavlov 2003-07-10
using blocks of size 262144 KB
Compressed 100000000 bytes into 42164371 bytes ==> 42.16%
Kernel Time = 0.437 = 00:00:00.437 = 1%
User Time = 25.968 = 00:00:25.968 = 96%
Process Time = 26.406 = 00:00:26.406 = 98%
Global Time = 26.844 = 00:00:26.844 = 100%
Kernel Time = 0.593 = 00:00:00.593 = 8%
User Time = 0.500 = 00:00:00.500 = 7%
Process Time = 1.093 = 00:00:01.093 = 15%
Global Time = 7.000 = 00:00:07.000 = 100%
========================================
My code compresses enwik8 7 times faster than lizard...
After I rewrite my code in asm it will work ~ 40% faster.