Page 3 of 5 FirstFirst 12345 LastLast
Results 61 to 90 of 140

Thread: another (too) fast compressor

  1. #61
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    While trying to solve a long-standing request spelled by Jethro, i've been getting into an unexpected complication.

    The objective is to allow Directory compression with LZ4.
    The basic idea is this one :
    since LZ4 can now deal with Pipe mode (both input & output), an easy way to provide directory compression would be to chain Tar with LZ4. This is a pretty obvious solution for Linux, where *.tar.gz files are the norm.

    However, for Windows, Tar is not an OS default utility.
    The only implementation i have found so far is GNU TAR; or variations of it.
    Unfortunately, the cost for this version is incredibly high : the binary itself is already pretty large (150KB minimum, depending on version), but what makes the budget ridiculously high are the dependancies on external DLL. The MSYS version of TAR for example costs more than 2MB, and this is without any help/man message.

    Since i could not find anything equivalent to ezGz's libtar.pas source code (which is fine, but written in Pascal), i'm looking for something i can use. Either a small C source code, or a fully built binary.

    The objective is to get a lighter equivalent to Gnu Tar. It does not even need to respect the TAR format. I just want it to pack all files in a directory and recursively into any sub-directory, using whatever (uncompressed) format it likes, and obviously be able to do the reverse operation (otherwise, COPY would be just enough...).
    Moreoever, i want this utility to be able to support pipe mode, for both input and output, to chain it with further processing.
    Lastly, the utility shall be "small". I doesn't need to be "very small" though, even 100K will do. But definately not multiple MB.

    Does anyone knows anything of that kind ?

  2. #62
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Have you tried static linking and optimizations for size?

  3. #63
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    You mean compiling the gnu tar ?
    I guess that's an almost impossible task.

  4. #64
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    1. There's a small portable tar in like
    http://downloads.sourceforge.net/pro...t/UnxUtils.zip
    http://downloads.sourceforge.net/pro...in32-0.6.3.exe
    (<120k w/o weird dependencies)
    But I think its a bad idea, simply because its headers are too large.

    2. You can try using Sami's "archiver template" - http://compressionratings.com/d_archiver_template.html
    Also see http://compressionratings.com/s_scan.html

    Unfortunately directory traversal is far from simple.

    Also I have my own tar equivalent, and it technically can be used with pipes,
    but it stores the whole archive index at the end of archive (like 7z), so extracting files from such archive
    via a pipe would be troublesome (would require tempfiles or something).

  5. #65
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks Shelwien.
    The version you linked into UnxUtils is exactly what i'm looking for.
    It almost work....

    Unfortunately, the tar.exe program seems to have problems with directory names starting with a t.
    As a consequence, \t is interpreted as a tabulation, resulting in a failure to find the directory.
    Quite a pity, looks like a stupid but nonetheless critical bug.
    I managed to get around it thanks to a batch manipulation which translates all \ in a path into /. Nothing is too easy in this world...

    I have to do further testings, but i hope that it solves the issue and makes a release possible.

    Note : i know that tar is not optimal; reading its specs is instructive, and shows inefficiencies (which are nonetheless easily squashed by further compression).
    http://en.wikipedia.org/wiki/Tar_(file_format)
    However, my objective is to find a quick fix for the current situation, letting improvements for later.
    Your scan implementation is very good, it requires some work to be integrated, it cannot be "plugged" as easily as a batch can pipe two processes.

  6. #66
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    91
    Thanks
    8
    Thanked 4 Times in 4 Posts
    Thanks for sticking with it. As a bonus archive functionality opens the possibility of putting multiple (small) files in one block (which should improve compression ratio for free in case of tiny files with similar contents, i.e. same extension).

  7. #67
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    I can try making a tar replacement, with interleaved filenames for pipe support.
    But it would be windows-only and won't preserve file times and attributes and such.
    (restoring attributes is tricky - eg. you can only set file attrs after completely unpacking the file,
    and dir attrs after unpacking all files and dirs within it - regardless of order of files in archive).

    However first I have to know that you're willing to use it :)

  8. #68
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks very much Shelwien.
    Don't bother just yet I've managed to get around the limitation : instead of calling the tar program directly, i'm calling a batch which do the necessary adaptation. No more any \t problem

    This version seems to work, so i've uploaded it at LZ4 Homepage :
    http://fastcompression.blogspot.com/p/lz4.html

    There is now a Windows Installer, which integrates LZ4 into the context menu.

    The way it works is by right clicking on a file, or a directory : the option "compress with LZ4" will be proposed. Selecting it has the same effect as drag'n'drop support for files.
    For directory, the behavior looks exactly the same. It's just that the batch file transparently calls Shelwien's tar.exe program and pipe the result into LZ4.

    For decoding, just double click on the compressed files.

    Well, little things, but the program is nonetheless far more agreable to use this way

    Feel free to comment

  9. #69
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Mmmh, it seems i've been talking too fast.
    The tests were fine yesterday with Windows 7, but are no longer with Windows XP this morning.
    It seems there is an issue with pipe.
    Unclear which one, since pipe mode works fine with tar.exe alone
    (ex : tar.exe cvf - dir > dest.tar)
    and works fine with lz4.exe alone
    (ex : lz4.exe stdin dest.lz4 < file)
    But the combination of both seems problematic under XP.

  10. #70
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    91
    Thanks
    8
    Thanked 4 Times in 4 Posts

    Cool

    Context menu is even better than drag&drop! Works great (W7). Is there a way to change default compression level level (-c0), btw?

    A Bug/Limitation - There is a problem with unicode(?) characters
    Code:
    Compressing stdin using 2 threads (compression level = 0)
    F:\Programy Visty\LZ4\tar.exe: Cannot add file Reps/child/Child/+?????_P vs +???
    +?_Z_1.rep: No such file or directory
    F:\Programy Visty\LZ4\tar.exe: Cannot add file Reps/child/Child/+?????_P vs +???
    +?_Z_2.rep: No such file or directory
    F:\Programy Visty\LZ4\tar.exe: Cannot add file Reps/child/Child/+?????_P vs +???
    +?_Z_3.rep: No such file or directory
    F:\Programy Visty\LZ4\tar.exe: Cannot add file Reps/child/Child/??u???_T vs +???
    +?_Z_1.rep: No such file or directory
    Visty\LZ4\tar.exe: Error exit delayed from previous errors
    Compression completed : 157.7MB --> 149.3MB  (94.65%) (156532783 Bytes)
    Total Time : 8.31s ==> 19.9MB/s
    (CPU : 0.48s = 6%)
    Also although i suggested tar myself i feel that tar temporary file creation is a bit inefficient because HDDs are still much slower than LZ4.
    Last edited by jethro; 28th October 2011 at 17:23.

  11. #71
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Hi
    There is currently no easy way to change default compression level (it would require to change some registry entries),
    but that's sure to change soon. That's on priority list.

    Currently, the main issue to solve is with tar.
    If tar does not support unicode, then something else (or another version of tar) will have to be found.
    And on top of this, the pipe(?) issue under windows XP also needs to be investigated & solved.

    @jethro : i'm interested in having a small sample of a directory with file names using non-roman unicode characters. Would you mind helping to provide one ?

    Regards

  12. #72
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    cyrillic name:
    Attached Files Attached Files
    • File Type: 7z a.7z (120 Bytes, 127 views)

  13. #73
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Tar itself supports UTF in the latest version (POSIX.1-2001 PAX). Your implementation may not.
    Anyway I find tar to be a very ugly way of supporting directory tree. Interleaving data and metadata in a single stream hurts compression.
    I know it's a slight issue for LZ4, but I would seek alternatives anyway.

  14. #74
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks for the test sample Bulat.
    Indeed, the TAR version which is shipped in current LZ4 installer is not able to deal with non-roman characters, so that's a dead-end. Another version will be necessary, or another concatenation utility.
    It's probably about time to accept Shelwien's offer

  15. #75
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    It's slightly off-topic, but since LZ4 was announced first on this forum, here is an interesting follow-up :

    http://fastcompression.blogspot.com/...mapreduce.html

    LZ4 has been integrated into Hadoop, as one of the fast compression algorithms alternatives for the BigData Apache project.

    https://issues.apache.org/jira/brows...mment-13171869

    It's pretty warmimg news, since even open-source projects don't necessarily become useful after release.
    Last edited by Cyan; 23rd December 2011 at 01:59.

  16. #76
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Congratulations!

    Please credit me for secondary promotions, if it's used

  17. #77
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Congratulatios

  18. #78
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    LZ4 will be incorporated into Linux kernel: http://www.phoronix.com/scan.php?pag...tem&px=MTA1OTQ

  19. #79
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    cuuute! btrfs is next-gen standard linux FS, so it will remain there forever

  20. #80
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks for notifying !
    So here is the explanation for the big spike in website hits yesterday...

    one word : wow

    I would certainly not have bet a dime on this course of events when the project was first open-sourced.

  21. #81
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Congratulations.

    BTW, it looks that the C port of Snappy is a significantly different algorithm...it deserves benching. If only I got more time...

  22. #82
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    For interested users : LZ4 v1.3 is out => http://fastcompression.blogspot.com/p/lz4.html

    Its more visible improvement is a much faster High Compression (-c2) mode, while nonetheless compressing (a bit) more.

    The i/o code is also significantly faster, but only benchmarks are likely to notice it,
    since it's only sensible with very fast drives, such as a RamDrive, or an array of SSD.
    Last edited by Cyan; 17th March 2012 at 15:19.

  23. #83
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    LZ4hc is now a part of LZ4 project and moved to BSD license:
    https://code.google.com/p/lz4/source/detail?r=66
    Thanks, Yann.

  24. #84
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    Yann, how about using 7+1 encoding for literal/match length, f.e. replacing

    for(; len > 254 ; len-=255) *op++ = 255; *op++ = (BYTE)len;

    with

    for(; unlikely(len > 127); len/=12 *op++ = 128+(BYTE)len%128; *op++ = (BYTE)len;

  25. #85
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Hi Bulat

    Yes, it would probably be more efficient, and also probably a bit slower.
    One has to test it, measure trade-off, get an idea of what's better.

    To be fair, there are also other things that i would do differently "now".
    But now that LZ4 is out, and a sizable user base is integrating it in many applications around,
    i don't want to introduce a change which would break compatibility.

    Forking LZ4 is always possible though.

  26. #86
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Cyan View Post
    Hi Bulat

    Yes, it would probably be more efficient, and also probably a bit slower.
    One has to test it, measure trade-off, get an idea of what's better.

    To be fair, there are also other things that i would do differently "now".
    But now that LZ4 is out, and a sizable user base is integrating it in many applications around,
    i don't want to introduce a change which would break compatibility.

    Forking LZ4 is always possible though.
    LZ5?

  27. #87
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Bulat, I made a quick test of your idea with LZ4 r85.
    A test with 4 working threads and test files divided into 128K chunks (my standard procedure):
    Code:
    Modified:
    scc.tar 102945694 1410 MB/s
    HANNOMB.ttf 22741828
    nbbs.tar 11746863
    calgary.tar 1639230
    scc.tar (not split into chunks) 101597296
    
    Original:
    scc.tar 102943473 1404 MB/s
    HANNOMB.ttf 22762712
    nbbs.tar 11754726
    calgary.tar 1639214
    scc.tar(not split into chunks) 101594941
    Despite the fact that it is weaker on scc, it seems to be an improvement.

    2 Yann:
    I noticed that in the main loop, encoding is unrolled, but in last literals it's not. Is it intentional?
    For the record, I didn't unroll Bulat's code.
    Last edited by m^2; 16th December 2012 at 22:40.

  28. #88
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    m^2, 7+1 encoding significantly improves the following corner cases:
    1. literal lengths encoding: now incompressible data are inflated 0.4%, with 7+1 encoding it will be inflated just by a few bytes
    2. match lengths encoding: long repetitions of the same byte or few bytes, such as areas filled with zeroes, now compressed to 0.4% of original size, with 7+1 it will be compressed just to a few bytes

    in freearc, i will leave standard LZ4 code

  29. #89
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    I noticed that in the main loop, encoding is unrolled, but in last literals it's not. Is it intentional?
    last literals code is performed just once per block, no need to unroll it

  30. #90
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Bulat said it all

Page 3 of 5 FirstFirst 12345 LastLast

Similar Threads

  1. Blizzard - Fast BWT file compressor!!!
    By LovePimple in forum Data Compression
    Replies: 40
    Last Post: 6th July 2008, 15:48
  2. PACKET v.0.01 new fast compressor !
    By Nania Francesco in forum Data Compression
    Replies: 45
    Last Post: 19th June 2008, 02:44
  3. RINGS Fast Bit Compressor.
    By Nania Francesco in forum Forum Archive
    Replies: 115
    Last Post: 26th April 2008, 22:58
  4. Tornado - fast lzari compressor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 23
    Last Post: 27th July 2007, 14:26
  5. Fast PPMII+VC Compressor
    By in forum Forum Archive
    Replies: 4
    Last Post: 2nd August 2006, 20:17

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •