Page 1 of 8 123 ... LastLast
Results 1 to 30 of 239

Thread: Ultra-fast LZ

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts

    Exclamation Ultra-fast LZ

    Let me introduce my ultra-fast LZ77-based file compressor called ULZ. It was designed to provide the fastest decompression possible and at the same time maintain a reasonable good compression ratio, unlike many other fast LZ77 compressors. ULZ has six compression levels to choose from:
    1. Fastest
    2. Fast
    3. Normal (Default)
    4. High
    5. Maximum
    6. Extreme
    As expected, the compression level doesn't really affect the decompression time or memory usage. The algorithm was invented a few years ago, while I worked on Nintendo DS compression suite. As a result, ULZ runs pretty fast even under stoneage hardware. Decompressor needs no extra memory. Anyway, this one is a PC version with increased memory usage and a larger LZ77 window, to provide some extra compression. At the end, we've got a compression that on par with Deflate, but with much faster decompression. Since ULZ uses only Byte-I/O - no Huffman, no Arithmetic compression, even no Bit-I/O. I'm still wondering how with that simple compression format and that simplest decompressor (a few lines of a code) it achieves some compression... So anyways, enjoy!

    UPDATE: (ULZ v0.02)
    • Added a new extra fast mode (c1)
    • Some code changes

    UPDATE 2017-07-13: The latest version is ULZ v0.06!
    Attached Files Attached Files

  2. #2
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    Thanks Ilia!

  3. #3
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Cool

    Quick test (c1) using ENWIK8 and ENWIK9 on my old Pentium 3, 750 MHz machine...


    ULZ v0.01 by Ilia Muraviev
    Compressing...
    100000000 -> 45751335 in 26.4 sec

    ULZ v0.01 by Ilia Muraviev
    Decompressing...
    45751335 -> 100000000 in 7.3 sec


    ULZ v0.01 by Ilia Muraviev
    Compressing...
    1000000000 -> 411826108 in 263.1 sec


    ULZ v0.01 by Ilia Muraviev
    Decompressing...
    411826108 -> 1000000000 in 88.7 sec

  4. #4
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts

    Wink

    Thanks a lot for testing!
    It would be cool, if on such an old machine you will test other fast LZ coders as well, and compare the results to ULZ. In addition, don't forget about other ULZ's compression levels!

  5. #5
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by [url=http://cs.fit.edu/~mmahoney/compression/text.html]http://cs.fit.edu/~mmahoney/compression/text.html[/url]
    An underlined time means that no better compressor is faster.
    In 5 mode it achieves that among the LZ77 compressors, in 1 mode itwould achieve that among all the compressors.
    Good job!

  6. #6
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    http://mattmahoney.net/dc/text.html#3326

    New Pareto frontier in decompression time.

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    http://nishi.dreamhosters.com/u/ulz001.htm (ramdrive)

    Well, metrics are certainly better than Lasse's, so its something.
    But decompression time is not quite there (probably due to i/o
    implementation) and note the expansion of incompressible data.

  8. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Quote Originally Posted by Matt Mahoney View Post
    http://mattmahoney.net/dc/text.html#3326

    New Pareto frontier in decompression time.
    Thanks a lot! I thought it will be faster - under my machine ULZ decompresses ENWIK9 in 3 sec...

  9. #9
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I must have a slow computer. But decompression is dominated by disk I/O. I measure process time. Wall time is about 5 times longer.

  10. #10
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    ULZ v0.02 has been released. Check it out above!

  11. #11
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    788
    Thanks
    64
    Thanked 274 Times in 192 Posts
    Tested ULZ 0.02 comp/decomp with enwik8:

    ulz c1 50,382,083 bytes 1.034 sec. 3.247 sec.
    ulz c2 45,751,335 bytes 1.525 sec. 2.955 sec.
    ulz c3 41,677,764 bytes 1.956 sec. 1.961 sec.
    ulz c4 39,368,127 bytes 3.104 sec. 1.438 sec.
    ulz c5 37,861,566 bytes 7.149 sec. 1.415 sec.
    ulz c6 37,652,826 bytes 13.335 sec. 1.415 sec.

    Compare:

    http://www.metacompressor.com/upload...estfile=enwik8
    http://www.metacompressor.com/top.aspx?testfile=enwik8

    Only QuickLZ used less decompression CPU time.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Thanks for testing!

  14. #14
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    Testing it on a Core2Duo E8300 (@3Ghz), Win7 x32,
    testing tool is Timer v8 from Igor Pavlov, aggregating user and kernel time, and exluding I/O time.
    Kernel take a significant share at such high speed, and even more so for ulz (where kernel is typically 1/3 of total).


    enwik8 decompression time :
    ulz v0.02 c1 : 0.55s
    ulz v0.02 c6 : 0.50s

    To provide some comparison, here are some apparently faster competitors :
    LZTurbo -13 : 0.49s
    FastLZ : 0.49s
    LZP2 : 0.49s
    LZTurbo -12 : 0.45s
    QuickLZ -L3 : 0.45s
    LZOP -2 : 0.45s
    LZOP -1 : 0.42s
    LZTurbo -10 : 0.41s
    LZ4 : 0.38s

    and very close ones :
    LZF : 0.52s
    Tornado -1 : 0.59s
    Zhuff : 0.59s
    LZRW3a : 0.60s


    Ranking change when using different files.
    win98.vmdk is a virtual hdd, with highly varying compression rate along the file.

    win98.vmdk decompression time :
    ulz v0.02 c1 : 1.52s
    ulz v0.02 c6 : 1.49s

    Competitors are :
    LZTurbo -12 : 1.38s
    LZTurbo -10 : 1.33s
    LZRW3a : 1.31s
    Zhuff : 1.27s
    QuickLZ -L3 : 1.10s
    FastLZ : 1.10s
    LZOP -2 : 0.96s
    LZ4 : 0.92s

    and close ones :
    LZP2 : 1.52s
    QuickLZ -L1 : 1.53s
    LZTurbo -20 : 1.59s
    QuickLZ -L2 : 1.75s

    See how some compressors become slower or faster relative to ULZ when comparing with enwik8 ranking.
    Honorable mention for LZRW3a, which is still among fastest ones in spite of its (relatively) old age....
    Last edited by Cyan; 3rd February 2010 at 13:33.

  15. #15
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    Don't forget about the compression ratio! Or else the COPY command will be the winner!

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    I guess its a good time to repeat
    that its plain wrong to compare different compressors
    by any of these virtual "times".
    There's a place for such timings too - eg. when estimating
    the effect of some improvements on a program.
    But the benchmarks are supposed to show the behavior of
    the programs in "natural" circumstances - not some estimations
    that nobody would be ever able to observe in real applications.
    And yeah, I know that its hard to get stable estimations
    with a multitasking OS and a real HDD.
    But it doesn't mean that the "process time" measurements
    taken with lots of apps open and AV running in background
    have any worth. Wall times still do btw - eg. the minimum of 10
    runs (not average!) would be still a good estimation of program's
    working time in such environment.
    However using a clean OS with no background activity and
    ramdrive for storage imho is a better choice.

    For an example we can consider two implementations of the
    same compression algo - one with sophisticated async i/o,
    and another which reads all the input into memory, compresses
    it in memory, and then stores all the output.
    Surely, the second approach would be noticeably faster using
    "excluding I/O time" measurement - as it won't have any buffer bounds
    check overheads.
    But does that really mean that there's no sense to waste
    time implementing that async i/o?
    Last edited by Shelwien; 3rd February 2010 at 14:12.

  17. #17
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,982
    Thanks
    377
    Thanked 351 Times in 139 Posts
    As to async I/O - it's a platform dependent stuff. ULZ's source is portable, as it should be. You may compile it under any OS. You may compile it even for use under Nintendo DS...

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    1. async i/o is not a matter of linking pthreads.
    You have to rewrite the whole algo so that it would be efficient.

    2. There're quite a few levels of _portable_ optimizations for
    your getc/putc i/o. But nobody would ever notice that these are
    necessary by only looking at "process times".

  19. #19
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    I guess its a good time to repeat
    that its plain wrong to compare different compressors
    by any of these virtual "times".
    There's a place for such timings too - eg. when estimating
    the effect of some improvements on a program.
    But the benchmarks are supposed to show the behavior of
    the programs in "natural" circumstances - not some estimations
    that nobody would be ever able to observe in real applications.
    And yeah, I know that its hard to get stable estimations
    with a multitasking OS and a real HDD.
    But it doesn't mean that the "process time" measurements
    taken with lots of apps open and AV running in background
    have any worth. Wall times still do btw - eg. the minimum of 10
    runs (not average!) would be still a good estimation of program's
    working time in such environment.
    However using a clean OS with no background activity and
    ramdrive for storage imho is a better choice.

    For an example we can consider two implementations of the
    same compression algo - one with sophisticated async i/o,
    and another which reads all the input into memory, compresses
    it in memory, and then stores all the output.
    Surely, the second approach would be noticeably faster using
    "exluding I/O time" measurement - as it won't have any buffer bounds
    check overheads.
    But does that really mean that there's no sense to waste
    time implementing that async i/o?
    Definitely agree that multiple measurements of wall times on a clean system is the only truly reliable way of benchmarking.

    But I don't agree about RAM Drive.
    Yes, for very fast algorithms, disks are slow. But that's what is used in real world and file compression programs shouldn't be optimized for something with so great IOPS and access times.

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Sure, but ramdrive stats are relevant too, these days.
    Its not unbelievable that people who would use fast LZ
    to improve the storage access speed, would also use ramdrive for
    a temp storage.

    Also, actually I agree about that and was posting hdd benchmarks
    before - and got into arguments about how slow my hdd is etc.

    So in fact, ramdrive use in benchmarks is a big step towards
    LZ developers, because in reality its still not unlikely to see
    i/o speeds like 5MiB/s (like a notebook hdd or a loaded machine),
    which makes LZ algos a complete anachronism.
    Last edited by Shelwien; 3rd February 2010 at 15:05.

  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Btw, while collecting the coders mentioned by Cyan, I encountered this:
    "LZO Professional incorporates some of our stunning new technologies we have
    silently developed over the last years, including the new Ternary Partioning
    Scheme and the revolutionary Data To Code Transformations (D2CT)."

    And for that "D2CT" I've got an idea like this:
    instead of parsing the LZ-compressed data and interpreting it to
    produce the uncompressed data, we can generate some
    intermediate x86 code by compressed data (ie compile it), which
    would later generate the uncompressed data.
    Imho seems interesting, if not very useful :)

    But as to the first "magic name" I have no idea :)

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    http://nishi.dreamhosters.com/u/ulz002a.htm

    Tested other LZ coders listed by Cyan.
    Entries inferior to some others by all 3 values are shown in gray.

    fastLZ was compiled with IntelC.

    Update: only size and dec.time are taken into account.
    http://nishi.dreamhosters.com/u/ulz002b.htm

    Update2: 1% larger c.size and both times faster considered better
    http://nishi.dreamhosters.com/u/ulz002c.htm
    Last edited by Shelwien; 3rd February 2010 at 19:17.

  23. #23
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    Btw, while collecting the coders mentioned by Cyan, I encountered this:
    "LZO Professional incorporates some of our stunning new technologies we have
    silently developed over the last years, including the new Ternary Partioning
    Scheme and the revolutionary Data To Code Transformations (D2CT)."

    And for that "D2CT" I've got an idea like this:
    instead of parsing the LZ-compressed data and interpreting it to
    produce the uncompressed data, we can generate some
    intermediate x86 code by compressed data (ie compile it), which
    would later generate the uncompressed data.
    Imho seems interesting, if not very useful

    But as to the first "magic name" I have no idea
    The closest viable option is to define a secure subset of x86 code (hard thing) and verify it before running. And I guess the verification would be too slow to make it beneficial.

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    seems you misunderstood.
    The idea is to compress the file with usual LZ,
    but on decoding compile a script for generating the uncompressed
    data, instead of parsing and uncompressing it at once.
    So we'd get some code like
    Code:
    ...
    LEA esi, [edi-123]
    MOV ecx, 13
    REP MOVSB
    MOV AL,0x20
    STOSB
    MOV EAX,0x20202020
    STOSD
    LEA esi,[edi-1000]
    MOV ecx, 100
    REP MOVSB
    ...
    So of course its not at any less secure than usual...
    the question is whether it would be faster.
    (Likely it would, but only for huge amounts of data
    and very high compression ratios.
    Also it kinda makes sense with multithreading).

  25. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    http://nishi.dreamhosters.com/u/ulz002d.htm
    - CSC 3.1 added on author's request

  26. #26
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    seems you misunderstood.
    The idea is to compress the file with usual LZ,
    but on decoding compile a script for generating the uncompressed
    data, instead of parsing and uncompressing it at once.
    So we'd get some code like
    Code:
    ...
    LEA esi, [edi-123]
    MOV ecx, 13
    REP MOVSB
    MOV AL,0x20
    STOSB
    MOV EAX,0x20202020
    STOSD
    LEA esi,[edi-1000]
    MOV ecx, 100
    REP MOVSB
    ...
    So of course its not at any less secure than usual...
    the question is whether it would be faster.
    (Likely it would, but only for huge amounts of data
    and very high compression ratios.
    Also it kinda makes sense with multithreading).
    I think I still don't get it.
    Is the code supposed to be stored in the archive?

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    No, the usual matches and literals would be stored in the archive.
    The x86 code is too redundant for that anyway.
    But with code generation, the decoding process can be split
    into two independent passes - code generation and data generation.
    Which can be more efficient than plain data generation in
    some circumstances.

  28. #28
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    461
    Thanked 257 Times in 105 Posts
    Btw, while collecting the coders mentioned by Cyan, I encountered this:
    "LZO Professional incorporates some of our stunning new technologies we have
    silently developed over the last years, including the new Ternary Partioning
    Scheme and the revolutionary Data To Code Transformations (D2CT)."
    Sounds a bit over-selling ...

    However using a clean OS with no background activity and
    ramdrive for storage imho is a better choice.
    I agree, but that's something which is not so easy to spare
    Now, maybe if we ask Tom's Hardware ...

    Tested other LZ coders listed
    that was fast !

    Don't forget about the compression ratio! Or else the COPY command will be the winner!
    Sure you right, but then should we also compare compression time, and so on (memory usage maybe?).
    Only compressors specially tweaked for decoding speed and disregarding compression time are really comparable, and they are not that numerous.
    LZOP -9, LZTurbo -19, your own LZSS, or good old Cabarc are the first that comes to mind.

    file compression programs shouldn't be optimized for something with so great IOPS
    Well, that sounds a bit reductive.
    Compression field is not just about competing into the same area as Winzip of 7zip. Most of these compressors are tools to learn and experiment, allowing some progresses which can later be used in unforeseen applications.

    As a sidenote, I currently have a mission in embedded application, which is directly attributable to my (very) recent public work in the compression field. First requirement is low CPU usage, second is moderate memory usage, and third (only) is compression rate. You see, there are many area in which compression can be usefull, not just desktop.

  29. #29
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    No, the usual matches and literals would be stored in the archive.
    The x86 code is too redundant for that anyway.
    But with code generation, the decoding process can be split
    into two independent passes - code generation and data generation.
    Which can be more efficient than plain data generation in
    some circumstances.
    So how is the generated code specific to the data?

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,368
    Thanks
    213
    Thanked 1,020 Times in 541 Posts
    Damn.

    Suppose that we have some simplified LZ code:

    Code:
    struct {
      word offset;
      byte length;
      byte literal;
    } lzdata[n_lzrecords];
    Then, there's a common way to decode it, like
    most LZ decoders work - parsing and interpreting
    the "instructions" at once:

    Code:
    byte* p = output_buffer;
    for( i=0; i<n_lzrecords; i++ ) {
      p += memcpy( p, p-lzdata[i].offset, lzdata[i].length );
      *p++ = lzdata[i].literal;
    }
    But it can be also implemented in a different way,
    with 2 passes - parsing/compiling and actual execution:

    Code:
    // code generation
    byte* q = code_buffer;
    *q++ = 0x97; // xchg edi,eax
    *q++ = 0x33; *q++ = 0xC9; // xor ecx,ecx
    for( i=0; i<n_lzrecords; i++ ) {
      // lea esi,[edi+offs]
      *q++ = 0x8D; *q++ = 0xB7; (uint&)*q=-lzdata[i].offset;
      *q++ = 0xF3; *q++ = 0xA4; // rep movsb
      *q++ = 0xB0; *q++ = lzdata[i].literal; // mov al,literal
      *q++ = 0xAA; // stosb
    }
    *q++ = 0xC3; // ret
    // uncompressed data generation
    ((void(*)(byte*))(void*)code_buffer)( output_buffer );
    The idea is that there could be a speed improvement, because
    like this there's no inner loop in parsing, so more efficient code
    can be generated by C++ compiler for it, and also there're only
    2 data locations involved at each stage, instead of 3 in normal
    LZ decoder, so the L1 data cache can be used more efficiently.

    Also, while it may be a bit hard to observe a real speedup due
    to this trick in a single-threaded decoder (the requirements
    would be like lzdata size > 1M and uncompressed data size > 100M,
    or something like that), it certainly makes some sense on a
    multicore system, as it allows to run the parsing and uncompressed
    data generation in parallel.

Page 1 of 8 123 ... LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •