Results 1 to 18 of 18

Thread: "bi-criteria" data compression

  1. #1
    Member
    Join Date
    Aug 2013
    Location
    United States
    Posts
    4
    Thanks
    4
    Thanked 4 Times in 2 Posts

    "bi-criteria" data compression

    I've not seen any references about this new compressor on this forum, so thought I'd post some links about it here to generate discussion:

    http://acube.di.unipi.it/bc-zip/

    http://zola.di.unipi.it/rossano/wp-c...pdf/soda14.pdf

  2. Thanks:

    Bulat Ziganshin (26th March 2014)

  3. #2
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    That sounds impressive:

    The performance of our BC-ZIP are extremely good even when compared with leading high-performance data compressors. On one hand, it generates parsings with decompression times which is on-par with than of LZ4 (slightly slower in Wikipedia,
    faster in Census) but with significantly improved compression ratios. On the other hand, its compression ratios at higher compression levels are close to those of the best one, namely that of LZMA2, but with an order of magnitude faster decompression speeds.


    That research group seems to be the best and one of the few doing basic theoretical work on LZ these days.

  4. #3
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    It's too bad they don't report compression time and don't publish code so we can test for ourselves.

    One of their tests is for 1 GB of Wikipedia text, but this is not enwik9 as far as I can tell. The sizes they give for other compressors like snappy, LZ4, LZMA, zlib, bzip2 are different than LTCB.

  5. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    It's too bad they don't report compression time and don't publish code so we can test for ourselves.

    One of their tests is for 1 GB of Wikipedia text, but this is not enwik9 as far as I can tell. The sizes they give for other compressors like snappy, LZ4, LZMA, zlib, bzip2 are different than LTCB.
    There is source code for a lot of algorithms that Giovanni Manzini co-authored on Manzini's web page. Manzini was not an author on this one. This group mentions that the code is in C++11. They've probably over-engineered the classes and templates and that's what's holding it up. C++ is an invitation to expend your efforts on everything besides creating working code.
    Last edited by nburns; 27th March 2014 at 03:10.

  6. #5
    Member
    Join Date
    Jul 2013
    Location
    Seattle, Washington, USA
    Posts
    2
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Matt Mahoney View Post
    It's too bad they don't report compression time and don't publish code so we can test for ourselves.
    The description from the paper (Weight-Constrained Shortest Path over the LZ77 DAG) makes me suspect this is around as slow as zopfli (<1MB/s).
    One of their tests is for 1 GB of Wikipedia text, but this is not enwik9 as far as I can tell. The sizes they give for other compressors like snappy, LZ4, LZMA, zlib, bzip2 are different than LTCB.
    The paper claims the Wikipedia text is from http://download.wikimedia.org/enwiki...ticles.xml.bz2 , but that's symlinked to whatever the latest dump was, which isn't very useful.

  7. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by rmmh View Post
    The description from the paper (Weight-Constrained Shortest Path over the LZ77 DAG) makes me suspect this is around as slow as zopfli (<1MB/s).
    They didn't claim the encoder was empirically fast in the conclusion, so it probably wasn't.

  8. #7
    Member
    Join Date
    Aug 2013
    Location
    United States
    Posts
    4
    Thanks
    4
    Thanked 4 Times in 2 Posts
    I just noticed that they've updated the above link with some more info, charts, and a link to the source code on github:

    https://github.com/farruggia/bc-zip

  9. Thanks (3):

    Cyan (23rd June 2014),Matt Mahoney (23rd June 2014),m^2 (24th June 2014)

  10. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    I am a bit disappointed. Decompression is very fast and compression ratio is good, but the website fails to mention that compression is slow and requires huge memory. I was able to compress enwik8 in Ubuntu but enwik9 throws std::bad_alloc (out of memory):

    bc-zip bit-optimal enwik8 enwik8.lzo -e soda09_16 -b 16 --> 29,900,824, 73s, 0.4s, 2715 MB memory

    The bit-optimal command means use best compression. There is a compress command that lets you specify bounds on either size or decompression speed, but it requires a target file specifying machine characteristics. It has a tool to generate the file, but that tool needs another file generated by LMBench3. I got an error when I tried to install it, so I gave up.

    The -e option specifies an encoder. I tried both of the recommended ones. The soda09_16 encoder gets better compression so I used it for most tests. The hybrid-16 encoder decompresses faster.

    The -b option specifies a bucket size in MB. I am testing on a machine with 4 GB memory. I found that options from -b 20 to -b 50 allocate 4.6 GB memory, and using the default requires 5.6 GB. It will still compress but is considerably slower with a lot of disk thrashing. enwik9 ran out of memory even with -b 1. Here are some results (size, real compression time, real decompression time, compression peak virtual memory):

    default --> 27,528,734, 597s, .4s, 5.6 GB
    -b 50 --> 28,387,413, 264s, .4s, 4.2 GB
    -b 32 --> 28,911,321, 187s, .4s, 4.6 GB
    -b 20 --> 29,619,793, 182s, .4s, 4.6 GB
    -b 16 --> 29,900,824, 73s, .4s, 2.7 GB
    -b 16 -e hybrid-16 -> 34,845,060, 72s, .26s, 3.0 GB

    Memory usage is peak virtual memory by watching the output of top. It tends to fluctuate around 3-4 GB most of the time. The program uses 1 thread.

    The test machine is a 2.67 GHz M620, 2+2 hyperthreads, 4 GB, 64 bit Ubuntu, compiled with g++ 4.8.2 and boost_1_55_0.

  11. #9
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I am a bit disappointed. Decompression is very fast and compression ratio is good, but the website fails to mention that compression is slow and requires huge memory. I was able to compress enwik8 in Ubuntu but enwik9 throws std::bad_alloc (out of memory):
    They all but admitted it by not claiming otherwise in the paper. See comment #6 above.

    If they used STL data structures, rather than writing their data structures by hand, then they probably left a lot of performance on the table.

  12. #10
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    140
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by nburns
    This group mentions that the code is in C++11. They've probably over-engineered the classes and templates and that's what's holding it up. C++ is an invitation to expend your efforts on everything besides creating working code.
    Quote Originally Posted by nburns View Post
    If they used STL data structures, rather than writing their data structures by hand, then they probably left a lot of performance on the table.
    These two statements seem incompatible. The STL is intended exactly to prevent "expanding your efforts on everything besides working code" whereas "wriing their data structures by hand" will do exactly that. It seems to me that they used C++ for exactly the right reasons. They were clearly focused on demonstrating good compression with high decoding speeds. If the chose the STL to invest only the required amount of engery in getting the encoding side solid enough to remain focused on what their goal was (decoding performance) then they used the STL as it was intended to be used.

    I gave a cursory look at the code and I didnt see any over-engineered classes and templates. They did use a lot of unwarrented typedefs and too much inheritence rather than using composition (in my opinion) but that's a design choice and not an example of 'over-engineering'.

    C++ is often the best tool available when placed in the hands of an experienced engineer. In the hands of a hobbiest or academic, perhaps not so much.

    Expend your time writing basic data structures and algorithms by hand rather using than the STL (unless its really required) is the best way to ensure that you will be spending far more engery later on with debuging, maintanance issues and condeming future developers to have to wade through it each time they need to do any work in that neighborhood. That is the best way to make sure that your efforts go to everything besides creating working code.

  13. #11
    Member
    Join Date
    Jun 2014
    Location
    Pisa, Italy
    Posts
    3
    Thanks
    0
    Thanked 4 Times in 2 Posts
    Hello,
    I'm Andrea Farruggia, one of the author of the bc-zip paper(s) and the main tool developer.

    A few words about the performances in compression.
    Although in our latest ESA paper (preprint available in the bc-zip website) we made important steps forward in lowering compression time (new "cache-friendly", FSG algorithm), due to lack of time we didn't put too much emphasis on memory consumption, which is admittedly high (~ 50GiB for compressing a 1GiB file without bucketing, as mentioned in our ESA paper).

    There is an ongoing effort to improve this situation, as we are evaluating different approaches to close the compression-time gap with gzip while retaining our decompression time/compression performances. For instance, one easy strategy to lower the working memory in compression would be to employ some variable-length integer representation in some of the compressor datastructures.
    In fact, in many places we use full 32/64-bit integers even when we know that those integers are going to be much smaller. Representing them in a more succinct way would save a lot of memory.

    We plan to improve the tool in the next months and keeping our "github" version up-to-date, so stay tuned.

  14. Thanks (2):

    Matt Mahoney (24th June 2014),nburns (24th June 2014)

  15. #12
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Is there a simple way to use it as a library?

  16. #13
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by michael maniscalco View Post
    These two statements seem incompatible. The STL is intended exactly to prevent "expanding your efforts on everything besides working code" whereas "wriing their data structures by hand" will do exactly that. It seems to me that they used C++ for exactly the right reasons. They were clearly focused on demonstrating good compression with high decoding speeds. If the chose the STL to invest only the required amount of engery in getting the encoding side solid enough to remain focused on what their goal was (decoding performance) then they used the STL as it was intended to be used.

    I gave a cursory look at the code and I didnt see any over-engineered classes and templates. They did use a lot of unwarrented typedefs and too much inheritence rather than using composition (in my opinion) but that's a design choice and not an example of 'over-engineering'.

    C++ is often the best tool available when placed in the hands of an experienced engineer. In the hands of a hobbiest or academic, perhaps not so much.

    Expend your time writing basic data structures and algorithms by hand rather using than the STL (unless its really required) is the best way to ensure that you will be spending far more engery later on with debuging, maintanance issues and condeming future developers to have to wade through it each time they need to do any work in that neighborhood. That is the best way to make sure that your efforts go to everything besides creating working code.
    There's nothing inherently wrong with STL containers, it's just that they are general-purpose and tuned for convenience, and in most cases you can probably write a faster compressor by using data structures tuned for your purpose. STL only gives you red-black trees and, only recently, hash tables for efficient associative containers, for instance.

    C++ isn't bad, but it has too many features that all look fun to try out, and that's a distraction. I prefer to use C.
    Last edited by nburns; 24th June 2014 at 22:42.

  17. #14
    Member
    Join Date
    Jun 2014
    Location
    Pisa, Italy
    Posts
    3
    Thanks
    0
    Thanked 4 Times in 2 Posts
    Quote Originally Posted by m^2 View Post
    Is there a simple way to use it as a library?
    Yes: use the API include/api.hpp and link against libraries bcobjs divsufsort.
    The API supports lets you decompress and compress bit-optimally a file ATM; we will expand the API to allow compressing a file with the bicriteria strategy in a future release.
    Last edited by af_; 24th June 2014 at 22:21.

  18. #15
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by af_ View Post
    Yes: use the API include/api.hpp and link against libraries bcobjs divsufsort.
    The API supports lets you decompress and compress bit-optimally a file ATM; we will expand the API to allow compressing a file with the bicriteria strategy in a future release.
    Is that the algorithm described in the paper "Bit-optimal Lempel-Ziv compression" (http://arxiv.org/pdf/0802.0835.pdf)?

  19. #16
    Member
    Join Date
    Jun 2014
    Location
    Pisa, Italy
    Posts
    3
    Thanks
    0
    Thanked 4 Times in 2 Posts
    Quote Originally Posted by nburns View Post
    Is that the algorithm described in the paper "Bit-optimal Lempel-Ziv compression" (http://arxiv.org/pdf/0802.0835.pdf)?
    Partially. The overall structure is the same, but the FSG algorithm (that is, the algorithm described in Section 5.1) has been substituted with a simpler-yet-still-optimal algorithm based on (hierarchical) suffix array scanning and described in our upcoming ESA paper (preprint).

  20. Thanks (2):

    Bulat Ziganshin (24th June 2014),nburns (24th June 2014)

  21. #17
    Member
    Join Date
    Nov 2014
    Location
    Earth
    Posts
    38
    Thanks
    0
    Thanked 77 Times in 19 Posts
    I've been reading the Farruggia et al. papers and I thought I would comment here. My post is not intended as a criticism of the work, which has some very interesting aspects, but rather to place it in context given its questionable claims about "dominating" "highly engineered" competitors, such as LZMA and LZ4; and also simply to bring up some problems that arise when designing practical LZ77-based formats.

    The "bicriteria" approach trades compressed size with decompression time in a controlled manner. It does *not* perform controlled tradeoffs regarding compression time, compressor memory usage, and decompressor memory usage. Indeed, the approach gains an unfair advantage over other LZ77-based compression algorithms by using a very large or even infinite sliding window size, which ignores the memory usage concerns that real compressors deal with. And to that end, all the authors' benchmarks were for huge files. If I recall correctly, the LZMA compressor defaults to a 64 MiB sliding window to limit memory usage in both the compressor and decompressor, even though it actually supports a 1 GiB sliding window. The DEFLATE format is even more restricted in this regard, being limited to just a 32 KiB sliding window (probably too small for 2014, but at least DEFLATE remains usable on severely memory-limited embedded systems).


    I like the idea of having the compressor estimate decompression time and using this as a factor in parsing decisions. However, the usefulness of any such scheme is necessarily limited by how well decompression time can be estimated. It isn't clear how well the authors' proposed approximation matches up with real computers. A few things that are important in practice that don't (yet) appear to be addressed by the proposed approximation are: a cache line loaded into the cache may satisfy multiple matches until it is evicted; copying speed is often determined by the number of whole words that need to be copied, not bytes; on many modern architectures there is a large penalty for unaligned memory accesses; and most importantly of all, the raw data copying is just one part of the overall decompression time. In fact, other factors such as how well the CPU predicts branches in the code can make a huge difference in decompression time.


    There is also the question of which integer encoder to use. This has a large effect on the decompression speed, but the bicriteria algorithm itself requires that the integer encoder be specified manually by the user. The algorithm then optimizes only in the context of that encoder. A more complete solution to "bicriteria" compression would need to automatically select an appropriate integer encoder.


    The authors assume the integer encoder satisfies the "non-decreasing cost property", meaning that larger numbers always are more costly to encode than smaller numbers. This property is used to reduce the number of matches need to be found and considered during optimal parsing. This stands in contrast to LZMA's optimal parser, for example, which might find and consider matches at a position, then find and consider those very same matches at the next position, but all one byte shorter. But it turns out that LZMA does things that way for a reason: its encodings of match lengths and distances do not satisfy the non-decreasing cost property. In fact, in any LZ77-based format where entropy encoding or adaptive state is being used, a larger length or larger distance may be less costly to encode than a smaller length or smaller distance. So the algorithm proposed by Farruggia et al. is, in fact, more specialized than the authors suggest; it cannot even be combined with Huffman encoding.


    So to summarize, I see the main contributions of the work as (1) considering decompression time as a factor in the parsing algorithm, even if the proposed method is very approximate; and (2) potentially faster optimal parsing for LZ77-based formats that do not use any entropy encoding or adaptive state. An algorithm that can be tuned to be fully competitive with both LZMA and LZ4 does *not* appear to be a contribution of the work at this point in time.

  22. #18
    Member
    Join Date
    Aug 2014
    Location
    Overland Park, KS
    Posts
    17
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Zyzzyva View Post
    I've been reading the Farruggia et al. papers and I thought I would comment here. My post is not intended as a criticism of the work, which has some very interesting aspects, but rather to place it in context given its questionable claims about "dominating" "highly engineered" competitors, such as LZMA and LZ4; and also simply to bring up some problems that arise when designing practical LZ77-based formats.

    The "bicriteria" approach trades compressed size with decompression time in a controlled manner. It does *not* perform controlled tradeoffs regarding compression time, compressor memory usage, and decompressor memory usage. Indeed, the approach gains an unfair advantage over other LZ77-based compression algorithms by using a very large or even infinite sliding window size, which ignores the memory usage concerns that real compressors deal with. And to that end, all the authors' benchmarks were for huge files. If I recall correctly, the LZMA compressor defaults to a 64 MiB sliding window to limit memory usage in both the compressor and decompressor, even though it actually supports a 1 GiB sliding window. The DEFLATE format is even more restricted in this regard, being limited to just a 32 KiB sliding window (probably too small for 2014, but at least DEFLATE remains usable on severely memory-limited embedded systems).


    I like the idea of having the compressor estimate decompression time and using this as a factor in parsing decisions. However, the usefulness of any such scheme is necessarily limited by how well decompression time can be estimated. It isn't clear how well the authors' proposed approximation matches up with real computers. A few things that are important in practice that don't (yet) appear to be addressed by the proposed approximation are: a cache line loaded into the cache may satisfy multiple matches until it is evicted; copying speed is often determined by the number of whole words that need to be copied, not bytes; on many modern architectures there is a large penalty for unaligned memory accesses; and most importantly of all, the raw data copying is just one part of the overall decompression time. In fact, other factors such as how well the CPU predicts branches in the code can make a huge difference in decompression time.


    There is also the question of which integer encoder to use. This has a large effect on the decompression speed, but the bicriteria algorithm itself requires that the integer encoder be specified manually by the user. The algorithm then optimizes only in the context of that encoder. A more complete solution to "bicriteria" compression would need to automatically select an appropriate integer encoder.


    The authors assume the integer encoder satisfies the "non-decreasing cost property", meaning that larger numbers always are more costly to encode than smaller numbers. This property is used to reduce the number of matches need to be found and considered during optimal parsing. This stands in contrast to LZMA's optimal parser, for example, which might find and consider matches at a position, then find and consider those very same matches at the next position, but all one byte shorter. But it turns out that LZMA does things that way for a reason: its encodings of match lengths and distances do not satisfy the non-decreasing cost property. In fact, in any LZ77-based format where entropy encoding or adaptive state is being used, a larger length or larger distance may be less costly to encode than a smaller length or smaller distance. So the algorithm proposed by Farruggia et al. is, in fact, more specialized than the authors suggest; it cannot even be combined with Huffman encoding.


    So to summarize, I see the main contributions of the work as (1) considering decompression time as a factor in the parsing algorithm, even if the proposed method is very approximate; and (2) potentially faster optimal parsing for LZ77-based formats that do not use any entropy encoding or adaptive state. An algorithm that can be tuned to be fully competitive with both LZMA and LZ4 does *not* appear to be a contribution of the work at this point in time.
    Yea I Think They Should Use Less Memory.

Similar Threads

  1. Replies: 7
    Last Post: 4th January 2016, 14:06
  2. PAKKA (ZPAQ's Win32 "versioned" unpacker)
    By fcorbelli in forum Data Compression
    Replies: 21
    Last Post: 24th June 2015, 23:29
  3. "Extreme" compression of DNS domains
    By nickety in forum Data Compression
    Replies: 20
    Last Post: 22nd October 2011, 01:20
  4. PAQ8 C++ precedence bug (or "-Wparentheses is annoying")
    By Rugxulo in forum Data Compression
    Replies: 13
    Last Post: 21st August 2009, 20:36
  5. LZ77 speed optimization, 2 mem accesses per "round"
    By Lasse Reinhold in forum Forum Archive
    Replies: 4
    Last Post: 11th June 2007, 21:53

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •