For acceptance of file formats, it can be good to make your code as easily adoptable as possible. The last thing you want is someone to reimplement your format in their own slightly buggy or incompatible code, simply because they're trying to avoid any license issues. Rightly or wrongly, this makes GPL and even LGPL a bad move in those cases. Yes it's just fine for companies to be linking in an LGPL library, but far too many companies have idiots dictating "thou shalt not use GPL" out of fear rather than understanding. That's annoying as there is very little you can do to combat idiocy.
The flip side is it's nice for improvements to your code to be forced to be public, but it depends on what you're trying to achieve; whether it is to harness a larger work force and get changes from others or whether it is to achieve maximum adoption / user base. Hence no one license is best.
nemequ@hoplite:~/local/src/lz4x/src$ git:(master) g++ -o lz4x lz4x.cpp
lz4x.cpp:33:23: fatal error: sys/utime.h: No such file or directory
#include <sys/utime.h>
^
compilation terminated.
The fix for this (uisng _WIN32 instead of depending on a non-standard NO_UTIME macro) was included in my pull request. AFAICT that is a Windows-specific header (POSIX has a <utime.h>, no idea if it is compatible with Windows' <sys/utime.h>).
Last edited by nemequ; 12th August 2016 at 17:59.
Reason: use [code] not [quote] (oops)
A few weeks ago I started my own LZ4 engine, called smalLZ4, which is based on optimal parsing, too.
There is a quick'n'dirty website with full C++/C source code: http://create.stephan-brumme.com/smallz4/
Code comes with a simple command-line interface and works fine with G++, Clang++ and Visual C++.
The compressor is a single C++ file (smallz4.cpp) without any external dependencies, while the decompressor is actually plain C code (a single file, too: smallz4cat.c).
My resulting files are a bit smaller than those produced by encode:
Right now my program needs about 6 minutes to compress enwik9, that's about half as fast as LZ4X.
Performance is my main issue and I still have to run quite a few tests to make sure my program always emits valid LZ4.
As soon as things have stabilized, I will add a Git repo on my server.
PS: I have no intention to ever break LZ4 compatibility (like using larger windows) - I just want to create perfect LZ4 files
Last edited by stbrumme; 12th August 2016 at 20:00.
Reason: optimized end-of-block handling which saves a few bytes
LZ4X compresses to Legacy LZ4 frame = fixed 8 MB independent blocks. smallz4 compresses to a newer frame which allows a solid (dependent) blocks which may improve compression a little bit.
The main reason is simplicity. The LZ4Demo(Legacy) format was pretty close to what I did with ULZ/CRUSH/LZSS/LZPM. So I just adapted some of my work to be LZ4-compatible!
Updated to LZ4X v1.15 - improved compiler compatibility (tested with Cygwin), replaced rewind() with fseek(f, 0, SEEK_SET) - since with some compilers rewind() works not properly with large files.
Compiled binaries really have no business in a repository… perhaps instead of adding the zip to the git repository you could create a release on GitHub? Putting compiled binaries in the repo basically makes everyone using the git repository download the binary for every version you ever create, and people typically use git because they want the source code, not a pre-compiled binary.
Less importantly, it would also be helpful if you made one commit per change (with a descriptive commit message, not "Updated to v1.xy"), then when you want to make a release create a tag (which will create a release on GitHub, then you can attach binaries to the release if you wish).
Hi, I'm a long-time lurker and finally decided to try some things. Thanks for sharing LZ4X and SmallLZ. I played with
SmallLZ, trying to optimize it a bit further (in optimal mode only); I don't think it can get as fast as LZ4X in that case
because of the matcher, which takes > 75% of the run time. I did manage to speed it up slightly, by about 10% as measured
on enwik8. There are other opportunities, but in -9 mode, it's all about the speed of findLongestMatch, and how often it gets
called. As always with optimization, YMMV - I used VS 2015 and its profiler to squeeze things a bit.
Only two changes are immediate wins - drop the default constructor for the Match() class, and try this match finding:
bool match4(const unsigned char * ptr, const int off) const
{
const auto a = reinterpret_cast<const uint32_t*>(ptr);
const auto b = reinterpret_cast<const uint32_t*>(ptr + off);
return *a == *b;
}
bool back_check_match(
const unsigned char * current,
int length,
const int off) const
{
length -= 4;
while (length > 0)
{
if (match4(current + length, off))
length -= 4;
else
return false;
}
// no need to handle remaining bytes, we knew the first 4 were OK anyway
return true;
}
/// find longest match of data[pos] between data[begin] and data[end], use match chain stored in previous
Match findLongestMatch(const unsigned char * data, const uint32_t pos, const uint32_t begin, const uint32_t end, const Distance * previous)
{
Match result; // no longer default-initialized to anything
result.length = 1;
// compression level: look only at the first n entries of the match chain
int32_t stepsLeft = maxChainLength;
// pointer to position that is matched against everything in data
const auto current = data + pos - begin;
const auto end_p = current + end - pos;
// get distance to previous match, abort if -1 => not existing
Distance distance = previous[pos % PreviousSize];
int neg_offset = 0;
while (distance != NoPrevious)
{
// too far back ?
neg_offset -= distance;
if (neg_offset < -MaxDistance)
return result;
// stop searching on lower compression levels
if (--stepsLeft < 0)
return result;
// prepare next position
distance = previous[(pos + neg_offset) % PreviousSize];
// go backwards to confirm beginning of match; we also know
// first 4 bytes are a match so we can got 4 bytes at a time
// and stop early
if (!back_check_match(current, result.length, neg_offset))
continue;
// now go forwards to extend match
const Length match_len =
find_match_end(current + result.length, end_p, neg_offset) - current;
// match longer than before ?
if (match_len > result.length)
{
result.distance = -neg_offset;
result.length = match_len;
}
}
return result;
}
Interesting thing. LZ4X is nearly first LZ compressor I'm not really able to use for pragmatic/practical purposes, despite of being cool thing.
Here is the story:
It exposes funny properties:
- On levels 1 to 8 it compresses quite fast, BUT somehow lz4hc (as seen in lzbench's LZ4 version) beats it (in terms of ratio) on all files I've gave to it.
- On level 9 LZ4x wins, sure. BUT I've got really fancy trouble on the way.
Problem with level 9 is:
* give it 1MiB file - works, beats LZ4HC, its cool!
* Give it a 1,1MiB file: still works, compression time is small enough, sure.
* 1.2MiB file. Works, but now this takes around 10 seconds to get there. But trend looks a but suspicious, no?
* 1.25MiB file. Still works, but now it like 35 seconds on my HW. Mere 50 KiB for 3x slowdown? Seems I've entered steep curve.
* 1.3MiB file... ok, I can wait and ... after roughly ~7 minutes its done, sure. But hey, another 50k crashed speed 15x?
* If I grab "full" ~2MiB file version, I'm not patient enough to wait until compression would complete.
So it seems when file is about 1.2-1.3MiB I'm hitting really sharp, exponential slowdown. Is this part of plan?
All mentioned nearly-1MiB files are the very same file (texture in DDS, fairly compressible by ~3x in LZ4X case), subsequentally truncated to figure out at which point it would work for me, I've did kind of "binary partitioning" of file size to get rough idea. On particular file and hardware it seems I can barely do 1.3MiB of that file and I'm not exactly sure how long it would take to compress whole 2MiB thing, trend does not looks promising at all.
Another interesting observation is the fact GCC 6.1 beats GCC 5.4 to the dust on this particular compressor, speeding compression up by about 2x or so (its rare sight to get 2x boost by mere compiler upgrade, lol). But granted overall sharp exponential-style trend it barely allows to get, say, 50KiB extra. Whole 2MiB thing did not completed compression even in few hours so I've gave up.
I've used GCC 5.4 and then 6.1 on x86-64 Linux (with later version proving to be considerably faster on this particular algo). I'm usually using -O3, but I've gave a shot to -O0 and -O2, it kills speed, but overall trend stays the very same, sharp exponential spike grossly outweights anything else.
Linux kernel (~20MiB highly redundant chunk of code + data) did not finished compression even if I let it run overnight. So, my "standard" test of trying to compress certain Linux kernel binary (it makes fancy crash test dummy and could also bring some practical/pragmatic benefits if crUsh test goes well) has .. failed. Seems in this case guys running test die due to age and crash test dummy gets bored and walks away.
I still wonder: is this sharp exponential growth of compression time is something really expected and desirable?
1. optimal parsing engines have O(n^2) worst-time behavior. it's usually fixed by "fast bytes" optimization, probably lz4x lacks it. bwt-based lz4a is the only optimal-parsing compressor that can overcome this limit
2. i'm pretty sure that behavior you seen was due to 8 MB L3 cache size. Optimal parsers usually use 8x of input data size to store the search tree
I think I've mostly solved this fancy puzzle. Mentioned file had a large chunk of zeros (0x00 bytes) at the end, starting roughly at 1.2 MiB boundary I've tracked down. Linux kernel also contains several large areas full of zeros in its uncompressed image. It seems its what provokes corner case and that's what I would call WORST case :P. So it seems LZ4X (and LZ4a) are going nuts after facing large chunk of zeros. As simple as that .
Yep, that is correct!
With next versions of LZ4X I'll keep -9 as a Brute level - as currently, but will change the -8 to "practical" optimal parsing by adding the search depth limit. Currently, -9 has no search limit of any kind - that's why we facing with some weird corner cases....
Yep, that is correct!
With next versions of LZ4X I'll keep -9 as a Brute level - as currently, but will change the -8 to "practical" optimal parsing by adding the search depth limit.
Sounds reasonable, I could suggest even several different levels doing this, using different limits (and resulting speed). I would agree it could be interesting what the best one could get within particular bitstream format, but it seems sometime one may really want it to be slightly suboptimal. Btw, LZ5 v2 stream format looks quite interesting, it targets even better ratios without being slow to decompress.
Currently, -9 has no search limit of any kind - that's why we facing with some weird corner cases....
Yeah, its not very hard to guess after taking a look on data using hex editor. Somehow I've forgot about this part and got really puzzled with this behavior. Though it seems I'm good at picking redundant files full of zeros.
Long runs of identical bytes (e.g. lots of zeros) aren't handled well by LZ4X. My implementation smallz4 (see discussion at http://encode.su/threads/2593-smallz4 or grab the source code at http://create.stephan-brumme.com/smallz4/) contains a small tweak: if the previous byte matched "itself", that means match distance is one, then my program assumes that this match is optimal for the current byte, too, and avoids finding a better match (see lines 505-517).
In addition, my implementation supports match finding across blocks boundaries and therefore usually achieves a better compression ratio than LZ4X.
Long runs of identical bytes (e.g. lots of zeros) aren't handled well by LZ4X. My implementation smallz4 (see discussion at http://encode.su/threads/2593-smallz4 or grab the source code at http://create.stephan-brumme.com/smallz4/) contains a small tweak: if the previous byte matched "itself", that means match distance is one, then my program assumes that this match is optimal for the current byte, too, and avoids finding a better match (see lines 505-517).
In addition, my implementation supports match finding across blocks boundaries and therefore usually achieves a better compression ratio than LZ4X.
So it completes compression in foreseeable time, indeed! Also, smallz level 9 beats lz4hc for sure. Btw, thanks for nice description of optimal parsing on your site.