I've not seen any references about this new compressor on this forum, so thought I'd post some links about it here to generate discussion:
http://acube.di.unipi.it/bc-zip/
http://zola.di.unipi.it/rossano/wp-c...pdf/soda14.pdf
I've not seen any references about this new compressor on this forum, so thought I'd post some links about it here to generate discussion:
http://acube.di.unipi.it/bc-zip/
http://zola.di.unipi.it/rossano/wp-c...pdf/soda14.pdf
Bulat Ziganshin (26th March 2014)
That sounds impressive:
The performance of our BC-ZIP are extremely good even when compared with leading high-performance data compressors. On one hand, it generates parsings with decompression times which is on-par with than of LZ4 (slightly slower in Wikipedia,faster in Census) but with significantly improved compression ratios. On the other hand, its compression ratios at higher compression levels are close to those of the best one, namely that of LZMA2, but with an order of magnitude faster decompression speeds.
That research group seems to be the best and one of the few doing basic theoretical work on LZ these days.
It's too bad they don't report compression time and don't publish code so we can test for ourselves.
One of their tests is for 1 GB of Wikipedia text, but this is not enwik9 as far as I can tell. The sizes they give for other compressors like snappy, LZ4, LZMA, zlib, bzip2 are different than LTCB.
There is source code for a lot of algorithms that Giovanni Manzini co-authored on Manzini's web page. Manzini was not an author on this one. This group mentions that the code is in C++11. They've probably over-engineered the classes and templates and that's what's holding it up. C++ is an invitation to expend your efforts on everything besides creating working code.
Last edited by nburns; 27th March 2014 at 04:10.
The description from the paper (Weight-Constrained Shortest Path over the LZ77 DAG) makes me suspect this is around as slow as zopfli (<1MB/s).
The paper claims the Wikipedia text is from http://download.wikimedia.org/enwiki...ticles.xml.bz2 , but that's symlinked to whatever the latest dump was, which isn't very useful.One of their tests is for 1 GB of Wikipedia text, but this is not enwik9 as far as I can tell. The sizes they give for other compressors like snappy, LZ4, LZMA, zlib, bzip2 are different than LTCB.
I just noticed that they've updated the above link with some more info, charts, and a link to the source code on github:
https://github.com/farruggia/bc-zip
Cyan (23rd June 2014),Matt Mahoney (23rd June 2014),m^2 (24th June 2014)
I am a bit disappointed. Decompression is very fast and compression ratio is good, but the website fails to mention that compression is slow and requires huge memory. I was able to compress enwik8 in Ubuntu but enwik9 throws std::bad_alloc (out of memory):
bc-zip bit-optimal enwik8 enwik8.lzo -e soda09_16 -b 16 --> 29,900,824, 73s, 0.4s, 2715 MB memory
The bit-optimal command means use best compression. There is a compress command that lets you specify bounds on either size or decompression speed, but it requires a target file specifying machine characteristics. It has a tool to generate the file, but that tool needs another file generated by LMBench3. I got an error when I tried to install it, so I gave up.
The -e option specifies an encoder. I tried both of the recommended ones. The soda09_16 encoder gets better compression so I used it for most tests. The hybrid-16 encoder decompresses faster.
The -b option specifies a bucket size in MB. I am testing on a machine with 4 GB memory. I found that options from -b 20 to -b 50 allocate 4.6 GB memory, and using the default requires 5.6 GB. It will still compress but is considerably slower with a lot of disk thrashing. enwik9 ran out of memory even with -b 1. Here are some results (size, real compression time, real decompression time, compression peak virtual memory):
default --> 27,528,734, 597s, .4s, 5.6 GB
-b 50 --> 28,387,413, 264s, .4s, 4.2 GB
-b 32 --> 28,911,321, 187s, .4s, 4.6 GB
-b 20 --> 29,619,793, 182s, .4s, 4.6 GB
-b 16 --> 29,900,824, 73s, .4s, 2.7 GB
-b 16 -e hybrid-16 -> 34,845,060, 72s, .26s, 3.0 GB
Memory usage is peak virtual memory by watching the output of top. It tends to fluctuate around 3-4 GB most of the time. The program uses 1 thread.
The test machine is a 2.67 GHz M620, 2+2 hyperthreads, 4 GB, 64 bit Ubuntu, compiled with g++ 4.8.2 and boost_1_55_0.
These two statements seem incompatible. The STL is intended exactly to prevent "expanding your efforts on everything besides working code" whereas "wriing their data structures by hand" will do exactly that. It seems to me that they used C++ for exactly the right reasons. They were clearly focused on demonstrating good compression with high decoding speeds. If the chose the STL to invest only the required amount of engery in getting the encoding side solid enough to remain focused on what their goal was (decoding performance) then they used the STL as it was intended to be used.Originally Posted by nburns
I gave a cursory look at the code and I didnt see any over-engineered classes and templates. They did use a lot of unwarrented typedefs and too much inheritence rather than using composition (in my opinion) but that's a design choice and not an example of 'over-engineering'.
C++ is often the best tool available when placed in the hands of an experienced engineer. In the hands of a hobbiest or academic, perhaps not so much.
Expend your time writing basic data structures and algorithms by hand rather using than the STL (unless its really required) is the best way to ensure that you will be spending far more engery later on with debuging, maintanance issues and condeming future developers to have to wade through it each time they need to do any work in that neighborhood. That is the best way to make sure that your efforts go to everything besides creating working code.![]()
Hello,
I'm Andrea Farruggia, one of the author of the bc-zip paper(s) and the main tool developer.
A few words about the performances in compression.
Although in our latest ESA paper (preprint available in the bc-zip website) we made important steps forward in lowering compression time (new "cache-friendly", FSG algorithm), due to lack of time we didn't put too much emphasis on memory consumption, which is admittedly high (~ 50GiB for compressing a 1GiB file without bucketing, as mentioned in our ESA paper).
There is an ongoing effort to improve this situation, as we are evaluating different approaches to close the compression-time gap with gzip while retaining our decompression time/compression performances. For instance, one easy strategy to lower the working memory in compression would be to employ some variable-length integer representation in some of the compressor datastructures.
In fact, in many places we use full 32/64-bit integers even when we know that those integers are going to be much smaller. Representing them in a more succinct way would save a lot of memory.
We plan to improve the tool in the next months and keeping our "github" version up-to-date, so stay tuned.
Matt Mahoney (24th June 2014),nburns (24th June 2014)
Is there a simple way to use it as a library?
There's nothing inherently wrong with STL containers, it's just that they are general-purpose and tuned for convenience, and in most cases you can probably write a faster compressor by using data structures tuned for your purpose. STL only gives you red-black trees and, only recently, hash tables for efficient associative containers, for instance.
C++ isn't bad, but it has too many features that all look fun to try out, and that's a distraction. I prefer to use C.
Last edited by nburns; 24th June 2014 at 23:42.
Last edited by af_; 24th June 2014 at 23:21.
Is that the algorithm described in the paper "Bit-optimal Lempel-Ziv compression" (http://arxiv.org/pdf/0802.0835.pdf)?
Partially. The overall structure is the same, but the FSG algorithm (that is, the algorithm described in Section 5.1) has been substituted with a simpler-yet-still-optimal algorithm based on (hierarchical) suffix array scanning and described in our upcoming ESA paper (preprint).
Bulat Ziganshin (25th June 2014),nburns (25th June 2014)
I've been reading the Farruggia et al. papers and I thought I would comment here. My post is not intended as a criticism of the work, which has some very interesting aspects, but rather to place it in context given its questionable claims about "dominating" "highly engineered" competitors, such as LZMA and LZ4; and also simply to bring up some problems that arise when designing practical LZ77-based formats.
The "bicriteria" approach trades compressed size with decompression time in a controlled manner. It does *not* perform controlled tradeoffs regarding compression time, compressor memory usage, and decompressor memory usage. Indeed, the approach gains an unfair advantage over other LZ77-based compression algorithms by using a very large or even infinite sliding window size, which ignores the memory usage concerns that real compressors deal with. And to that end, all the authors' benchmarks were for huge files. If I recall correctly, the LZMA compressor defaults to a 64 MiB sliding window to limit memory usage in both the compressor and decompressor, even though it actually supports a 1 GiB sliding window. The DEFLATE format is even more restricted in this regard, being limited to just a 32 KiB sliding window (probably too small for 2014, but at least DEFLATE remains usable on severely memory-limited embedded systems).
I like the idea of having the compressor estimate decompression time and using this as a factor in parsing decisions. However, the usefulness of any such scheme is necessarily limited by how well decompression time can be estimated. It isn't clear how well the authors' proposed approximation matches up with real computers. A few things that are important in practice that don't (yet) appear to be addressed by the proposed approximation are: a cache line loaded into the cache may satisfy multiple matches until it is evicted; copying speed is often determined by the number of whole words that need to be copied, not bytes; on many modern architectures there is a large penalty for unaligned memory accesses; and most importantly of all, the raw data copying is just one part of the overall decompression time. In fact, other factors such as how well the CPU predicts branches in the code can make a huge difference in decompression time.
There is also the question of which integer encoder to use. This has a large effect on the decompression speed, but the bicriteria algorithm itself requires that the integer encoder be specified manually by the user. The algorithm then optimizes only in the context of that encoder. A more complete solution to "bicriteria" compression would need to automatically select an appropriate integer encoder.
The authors assume the integer encoder satisfies the "non-decreasing cost property", meaning that larger numbers always are more costly to encode than smaller numbers. This property is used to reduce the number of matches need to be found and considered during optimal parsing. This stands in contrast to LZMA's optimal parser, for example, which might find and consider matches at a position, then find and consider those very same matches at the next position, but all one byte shorter. But it turns out that LZMA does things that way for a reason: its encodings of match lengths and distances do not satisfy the non-decreasing cost property. In fact, in any LZ77-based format where entropy encoding or adaptive state is being used, a larger length or larger distance may be less costly to encode than a smaller length or smaller distance. So the algorithm proposed by Farruggia et al. is, in fact, more specialized than the authors suggest; it cannot even be combined with Huffman encoding.
So to summarize, I see the main contributions of the work as (1) considering decompression time as a factor in the parsing algorithm, even if the proposed method is very approximate; and (2) potentially faster optimal parsing for LZ77-based formats that do not use any entropy encoding or adaptive state. An algorithm that can be tuned to be fully competitive with both LZMA and LZ4 does *not* appear to be a contribution of the work at this point in time.