Page 1 of 2 12 LastLast
Results 1 to 30 of 32

Thread: CSArc (CSC) 3.3 , A LZ77 compressor again.

  1. #1
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts

    CSArc (CSC) 3.3 , A LZ77 compressor again.

    Hi All

    More than 4 years ago, I worked out the CSC3.2Final, with the mixed ideas and codes from several other compressors.

    Recently Stephan Busch gave me great encouragement that I should refine the work. I did it, since I also feel previous codes were not
    good and my programming more or less improved.

    Then I got the libcsc / csarc 3.3. Most part/idea were still keep the same from csc3.2, however, 80+% codes were rewritten.

    Here is the specification of libcsc:
    * The whole compressor were mostly inspired by LZMA, with some ideas from others or my own mixed.
    * Based on LZ77 Sliding Window + Range Coder. Literals are with pure order-1, literal/match flag was coded with state of previous 3 flag.
    Match length / distance was coded into different slots. These are quite similar with LZMA.
    * Match finder for LZ77 is using 2-bytes hash table and 3-bytes hash table with only one candidate. And for -m1 .. -m4, another 6-bytes
    hash table was used, with width(candidates) of 1, 8, 2, 8.
    For -m5, there is binary tree with 6-bytes hash searching head. The codes for binary tree was from LZMA. The binary tree match finder
    does not hold whole dictionary range, instead it is smaller, to avoid 10x dictionary size memory usage. But the searching head table for BT
    does not have such limitation.
    * Two kinds of parser: Lazy one and advanced one. Advanced parser is calculating the lowest price path for every several KB chunk.
    -m1 , -m2 is using the lazy one, -m3 .. -m5 for the advanced. Details for m1 .. m5 configure will be find in csc_enc.cpp
    * A simple analyzer is used, on every 8KB raw data coming in. It is calculating the entropy of current block, char similarity of every 1,2,3,4,8
    bytes for potential data tables, or EXE codes or English text.
    * Continuous blocks with same type analyzed will be compressed as one chunk, base on its type, may be using:
    a. e8e9 preprocessor, Shelwien gave me the code before.
    b. A very simple & fast English preprocessor, just replacing some words into char 128 to 255. The dictionary was from Sami Runsas.
    c. Delta preprocessor on corresponding channel, and then use order-1 direct coding instead of LZ77. The entropy for delta-processed block
    was also calculated and compared before to decide coding to avoid false positive.
    d. Store original for high entropy data, no compression.
    e. For c&d, the original data was stored into LZ77 window and match finder. Before it is compressed, it will be checked whether the block is
    duplicated. So for two same block in one file, it will still be compressed into one.
    * libcsc API is very LZMA liked now, using the same Types.h in LZMA.
    * Memory usage will be about 2 - 4x dictionary size.

    For csarc:
    * Cross-platform thread / file / directory operation codes were from ZPAQ.
    * Files were sorted by extension and size ( if not a very small file), then split into tasks by extension.
    * Each task is compressed by one thread, and there can be several threads working for multiple tasks.
    * Each worker task contains separate I/O thread and buffer.
    * Multi-threading also works in same way when extracting.
    * For single big file, -p# switch can be used to force split the file into multiple tasks for multi-threading, however will hurt compression.
    * Memory usage will be multiplied by thread number.
    * Adler32 checksum is used.
    * Meta info of all files is also compressed by libcsc, appended at the end of all other compressed data.

    Usage of CSArc:
    Create a new archive:
    csarc a [options] archive_name file2 file2 ...
    [options] can be:
    -m[1..5] Compression level from most efficient to strongest
    -d##[k|m] Dictionary size, must be range in [32KB, 1GB]
    -r Recursively adding files in directories
    -f Forcely overwrite existing archive
    -p## Only works with single file compression
    split it for multi-threading
    -t# Multithreading-number, range in [1,8]
    Memory usage will be multiplied by this number

    Extract file(s) from archive:
    csarc x [options] archive_name [file1_in_arc file2_in_arc ...]
    [options] can be:
    -t# Multithreading-number, range in [1,8]
    Memory usage will be multiplied by this number
    -o out_dir Extraction output directory

    List file(s) in archive:
    csarc l [options] archive_name [file1_in_arc file2_in_arc ...]
    [options] can be:
    -v Shows fragment information with Adler32 hash

    Test to extract file(s) in archive:
    csarc t [options] archive_name [file1_in_arc file2_in_arc ...]
    [options] can be:
    -t# Multithreading-number, range in [1,8]

    Example:
    csarc a -m2 -d64m -r -t2 out.csa /disk2/*
    csarc x -t2 -o /tmp/ out.csa *.jpg
    csarc l out.csa
    csarc t out.csa *.dll



    Codes:
    https://github.com/fusiyuan2010/CSC

    Attached Windows x86 exe and Linux x64 exe, with src.

    Originally developed under Linux, I compiled the windows one with VC++ 2010 Express.
    So under windows memory usage may be limited, I am not able to compile x64 version under windows. I will be appreciate if somebody can make a 64-bit version.

    The code is in Public Domain. I used some codes in ZPAQ, I don't know if it is good since ZPAQ is GPL licensed.

    Any bug is welcome to report.

    Update on 2015.03.26:
    1. Compiled x64 windows.
    2. Fixed bug of not able to open file with multi byte character name.
    3. Fixed .c_str() compilation error in csa_file.h.
    4. Fixed crash on > 4GB file of 32-bit compilation.
    5. Fixed progress indicator bug while file size > 4GB.
    6. Fixed a small typo, and the always shown overwrite warn.
    Attached Files Attached Files
    Last edited by Fu Siyuan; 27th March 2015 at 04:25. Reason: Fixed some of bugs.

  2. The Following 15 Users Say Thank You to Fu Siyuan For This Useful Post:

    avitar (21st March 2015),Bulat Ziganshin (21st March 2015),cade (25th March 2015),Cyan (22nd March 2015),Gonzalo (22nd March 2015),Mat Chartier (21st March 2015),Matt Mahoney (22nd March 2015),Mike (21st March 2015),milky (10th June 2017),m^2 (21st March 2015),Nania Francesco (21st March 2015),ne0n (21st March 2015),nemequ (21st March 2015),Stephan Busch (21st March 2015),Surfer (21st March 2015)

  3. #2
    Member
    Join Date
    Jun 2013
    Location
    Canada
    Posts
    36
    Thanks
    24
    Thanked 47 Times in 14 Posts
    Very nice, enwik8 to 24,593,703 in 59.5s and decomp in 2.9s (-m5 -d64).

    I'm working on some dynamic dictionary stuff for mcm v0.83 and found that the following approach works pretty well (in case you're interested in improving on your dictionary preprocessing):
    use capital conversion for first char / whole words
    count the frequencies of all words 3 <= len < xx
    filter out all the words which don't occur at least ~5-10 times
    sort in descending order by using (word.length - 1) * word.freq
    pick top 20 words for 1 byte codes (starting at byte 128 )
    pick next top (128-20) * 128 words for 2 byte codes
    after assigning 1 byte codes, sort the words that will have 2 byte codes lexicographically
    assign 2 byte codes (using bytes 148+, 128+) using the lexicographically sorted words
    using this approach, you should be able to get enwik8 to ~23.6m, dictionary construction takes around 8s for me.

  4. #3
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    Simply great work ! I am very happy for your return !

  5. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 659 Times in 354 Posts
    afair zpaq changed the license to BSD since Matt left Dell. also, can you use github for hosting your code?
    Last edited by Bulat Ziganshin; 21st March 2015 at 12:54.

  6. #5
    Tester
    Nania Francesco's Avatar
    Join Date
    May 2008
    Location
    Italy
    Posts
    1,565
    Thanks
    220
    Thanked 146 Times in 83 Posts
    In WCC benchmark 2015 (with Intel Core I7 920 2,66ghz 6GB ram)
    CSARC 3.3 fails (program go in crash) with option "a -r -f -m5 -d512m -t4 "
    Why not enable solid mode?
    Last edited by Nania Francesco; 21st March 2015 at 14:09.

  7. The Following User Says Thank You to Nania Francesco For This Useful Post:

    Fu Siyuan (27th March 2015)

  8. #6
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    afair zpaq changed the license to BSD since Matt left Dell. also, can you use github for hosting your code?
    Libzpaq is public domain.
    IIRC the frontent is GPL.

  9. #7
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    afair zpaq changed the license to BSD since Matt left Dell. also, can you use github for hosting your code?
    Quote Originally Posted by m^2 View Post
    Libzpaq is public domain.
    IIRC the frontent is GPL.
    libzpaq has always (or at least for a long time) been public domain. As of 7.01 the frontend is also public domain—see http://encode.su/threads/456-zpaq-up...ll=1#post42577

    Quote Originally Posted by Fu Siyuan View Post
    The code is in Public Domain. I used some codes in ZPAQ, I don't know if it is good since ZPAQ is GPL licensed.
    What version of ZPAQ did you copy from? If it was pre-7.01, would it be difficult to replace it with code from the current ZPAQ?

  10. The Following User Says Thank You to nemequ For This Useful Post:

    m^2 (21st March 2015)

  11. #8
    Member
    Join Date
    Jun 2013
    Location
    Sweden
    Posts
    150
    Thanks
    9
    Thanked 25 Times in 23 Posts
    Test of DVD-movie from 1TB HDD <-> RamDisk Z:

    BSC310 compressed to 4188961022 in 6 minutes (-b768e2rsp).

    Z:\csarc3.3\csarc_src>csc.exe c -m5 -d550m e:\x\A3622906.iso iso.csc
    Estimated memory usage: 1678 MB
    4682022912 -> 4406882291
    Compression in 28 minutes 19 seconds

    Z:\csarc3.3\csarc_src>csc d iso.csc e:\x\iso.iso
    4406877522 -> 4682022912
    Decompression in 4 minutes 17 seconds

    CRC32 verified and matched.

    Unable to use 600mb (program crashed) so I used 550mb. CSC was created by me then i tried to compile with mingw492, with alot of errors.

    Did find csarc_win_x86.exe after my test.

    Z:\csarc3.3>csarc_win_x86.exe a -m1 -d63m -t4 -f -p4 iso.csc e:\x\A3622906.iso
    Compressed Size: 4430857100 in 8 minutes 10 seconds

    Z:\csarc3.3>csarc_win_x86.exe x -t4 -o e:\iso.iso iso.csc
    Decompression in 1 minutes 40 seconds

    Oh! -o is "Extraction output directory" and i got the file in E:\iso.iso\e\x Who cares? CRC32 verified and matched.

    7z920 a -m0=lzma2:d28 -mmt=8 z:\iso *.iso gave 4413291818 in 5 minutes 22 seconds.
    Last edited by a902cd23; 21st March 2015 at 23:18. Reason: add 7z

  12. The Following User Says Thank You to a902cd23 For This Useful Post:

    Fu Siyuan (27th March 2015)

  13. #9
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    150
    Thanks
    30
    Thanked 59 Times in 35 Posts
    if file path has multi byte character, can't open file.

  14. The Following User Says Thank You to xezz For This Useful Post:

    Fu Siyuan (27th March 2015)

  15. #10
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    I'm seriously disappointed with fact that a lot of respectable members of Encode community expressed their gratitude to the topic starter.

    1. The sources are not compilable. I wonder how the author was managed to create the binaries, just because there is an error in csa_file.h which causes the following message:
    Code:
    cannot pass objects of non-trivially-copyable type
    Though it's fixable with c_str.

    2. Clumsy multithread implementation. I tried to archive the folder (8592 files, 3.27GB) using csarc.exe a -f -r -m5 -d1m -t8 TEST.cs d:\MingW\IC\*
    At the beginning CSArc really utilizes all of the cores but the number of utilized cores is quickly dropping to a value of 1 and such behaviour results in following stats:
    Code:
    Process Time     :         754.578s
    Clock Time       :         415.765s
    Is it normal scalability for 8 threads? No.

    3. There is no support for big files. I dont know what the limit is but when I tried to compress a 9.69GB file with -m1 -d1m -t1, CSArc just crashed at the end.
    Interesting thing is that the whole file have been read (IO Read: 10168226 KB) but console progress bar reports only 17% which simply means that it's programmed with bugs too.
    Output file has no header and you can't extract even a byte from it.

    4. CSArc is NOT LOSSLESS.
    Code:
    C:\ARC\CSC\csarc3.3\32 csarc.exe x -o g:\TEMP\OUT TEST.cs
    CSArc 3.3, experimential archiver by Siyuan Fu
     (https://github.com/fusiyuan2010)
    [==>                                               ]   6% done******** d:\MingW\IC\Compiler/11.1/072/ipp/em64t/lib/ippjmergedem64t.lib extraction/verify failed
    [=====>                                            ]  13% done******** d:\MingW\IC\Compiler/11.1/072/ipp/em64t/lib/ippvmmergedem64t.lib extraction/verify failed
    [==========>                                       ]  23% done******** d:\MingW\IC\Compiler/11.1/072/ipp/ia32/lib/ippsmerged_t.lib extraction/verify failed
    [=============>                                    ]  29% done******** d:\MingW\IC\Compiler/11.1/072/ipp/ia32/lib/ippimerged.lib extraction/verify failed
    [==============>                                   ]  30% done******** d:\MingW\IC\Compiler/11.1/072/ipp/ia32/lib/ippmmerged_t.lib extraction/verify failed
    [================>                                 ]  34% done******** d:\MingW\IC\Compiler/11.1/072/ipp/em64t/lib/ippgenmergedem64t.lib extraction/verify failed
    [==================================================]  100% done
    
    Process ID       : 3140
    Thread ID        : 3692
    Process Exit Code: 0
    Thread Exit Code : 0
    Should I comment it?
    Also please note the exit code.

    Quote Originally Posted by Fu Siyuan View Post
    and my programming more or less improved.
    Impressive joke I must say. Self irony is the good skill. Anyway, sorry to tell you that you didn't make a single step forward.
    Also about x64-win. Its easy to create it but I think there is no point to publish compiles of junk software.

  16. The Following User Says Thank You to Skymmer For This Useful Post:

    Fu Siyuan (27th March 2015)

  17. #11
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by Skymmer View Post
    I'm seriously disappointed with fact that a lot of respectable members of Encode community expressed their gratitude to the topic starter.
    He released a pretty large body of code, which he obviously spent a lot of time on, as open source software. He didn't have to do that. While I would be happier if the code were perfect, he has already done more that enough to earn my gratitude.

    Furthermore, he seems to be encouraging bug reports ("Any bug is welcome to report."), which implies he is interested in fixing these issues. I don't have any problem with "release early, release often".

    Quote Originally Posted by Skymmer View Post
    1. The sources are not compilable. I wonder how the author was managed to create the binaries, just because there is an error in csa_file.h which causes the following message:
    Code:
    cannot pass objects of non-trivially-copyable type
    Though it's fixable with c_str.
    It works for me (GCC 4.9.2 on Fedora 21). Could you provide a bit more detail (maybe compiler, compiler version, and what exactly doesn't work)?

  18. The Following 4 Users Say Thank You to nemequ For This Useful Post:

    avitar (22nd March 2015),Cyan (23rd March 2015),Gonzalo (22nd March 2015),milky (10th June 2017)

  19. #12
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Quote Originally Posted by nemequ View Post
    He released a pretty large body of code, which he obviously spent a lot of time on, as open source software.
    The size of the code doesn't matter. But seems you're judging the quality by its size and you think that 229 764 bytes of sources can be named "a pretty large body of code"...
    5 855 141 bytes of FreeArc sources are pretty large.
    6 038 452 bytes of 7z sources are pretty large.
    229 764 bytes is almost nothing.

    Quote Originally Posted by nemequ View Post
    It works for me (GCC 4.9.2 on Fedora 21). Could you provide a bit more detail (maybe compiler, compiler version, and what exactly doesn't work)?
    I was meaning the Windows. More exactly speaking mingw-w64 i686-4.9.2-release-win32-dwarf-rt_v3-rev1.

  20. #13
    Member
    Join Date
    Jul 2013
    Location
    United States
    Posts
    194
    Thanks
    44
    Thanked 140 Times in 69 Posts
    Quote Originally Posted by Skymmer View Post
    The size of the code doesn't matter. But seems you're judging the quality by its size and you think that 229 764 bytes of sources can be named "a pretty large body of code"...
    No, I'm not judging the quality at all. I'm judging the amount of effort it took (hence the "which he obviously spent a lot of time on" bit) which, although nothing on the order of FreeArc or 7zip, was obviously non-trivial. I haven't looked at the code in much detail so I can't really comment on the quality, but regardless of what it is now it can improve over time, especially if people point out bugs or other areas for improvement.

    I think this really comes down to our respective feelings about release early, release often. In my opinion it's a perfectly valid way to do things, and I don't see anything wrong with being grateful for an early release even if it is buggy. Frankly I prefer RERO, since feedback will hopefully make the end result better than if it was developed behind closed doors.

    Quote Originally Posted by Skymmer View Post
    I was meaning the Windows. More exactly speaking mingw-w64 i686-4.9.2-release-win32-dwarf-rt_v3-rev1.
    Interesting that the same version of GCC is working for me, but not you. Anyways, I'm still not sure *what* exactly doesn't work—all you've said is that it is something in csa_file.h, no line number, snippet, etc. I'm sure Fu Siyuan would appreciate that information, since he seems to also be on the "works for me" side of the issue.

  21. The Following 2 Users Say Thank You to nemequ For This Useful Post:

    avitar (22nd March 2015),Matt Mahoney (22nd March 2015)

  22. #14
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Yes, you're right, I wasn't too clear in my report.

    csa_file.h ln:234
    Code:
    fprintf(stderr, "File open error %s\n", filename);
    On Windows platform with mingw-64 i686-4.9.2-release-win32-dwarf-rt_v3-rev1, you will get:
    Code:
    In file included from archiver\csarc.cpp:6:0:
    archiver/csa_file.h: In member function 'bool OutputFile::open(const char*)':
    archiver/csa_file.h:234:57: error: cannot pass objects of non-trivially-copyable type 'std::wstring {aka class std::basic_string<wchar_t>}' through '...'
             fprintf(stderr, "File open error %s\n", filename);
                                                             ^
    In file included from archiver\csa_file.cpp:1:0:
    archiver/csa_file.h: In member function 'bool OutputFile::open(const char*)':
    archiver/csa_file.h:234:57: error: cannot pass objects of non-trivially-copyable type 'std::wstring {aka class std::basic_string<wchar_t>}' through '...'
             fprintf(stderr, "File open error %s\n", filename);
                                                             ^
    In file included from archiver/csa_io.h:4:0,
                     from archiver/csa_worker.h:6,
                     from archiver/csa_progress.h:7,
                     from archiver\csa_progress.cpp:1:
    archiver/csa_file.h: In member function 'bool OutputFile::open(const char*)':
    archiver/csa_file.h:234:57: error: cannot pass objects of non-trivially-copyable type 'std::wstring {aka class std::basic_string<wchar_t>}' through '...'
             fprintf(stderr, "File open error %s\n", filename);
                                                             ^
    In file included from archiver/csa_io.h:4:0,
                     from archiver/csa_worker.h:6,
                     from archiver\csa_worker.cpp:1:
    archiver/csa_file.h: In member function 'bool OutputFile::open(const char*)':
    archiver/csa_file.h:234:57: error: cannot pass objects of non-trivially-copyable type 'std::wstring {aka class std::basic_string<wchar_t>}' through '...'
             fprintf(stderr, "File open error %s\n", filename);
    Supposed solution:
    csa_file.h ln:234
    Code:
    fprintf(stderr, "File open error %s\n", filename.c_str());

  23. The Following User Says Thank You to Skymmer For This Useful Post:

    Fu Siyuan (27th March 2015)

  24. #15
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Agree with nemequ - think skymmer is was being rather harsh. Don't we want to get new people involved in compression, new ideas too? -it does mostly seem like quite an exclusive club on this forum. And if the coding isn't as good as that of some of the other club members, so long as they take on board constructive comments surely that's ok? j (one of the thankers!)

  25. #16
    Tester
    Stephan Busch's Avatar
    Join Date
    May 2008
    Location
    Bremen, Germany
    Posts
    873
    Thanks
    460
    Thanked 175 Times in 85 Posts
    I encouraged Siyuan to further write on CSC and don't give it up.
    If you write new things there always can be bugs and it's ok for me.
    Trying out something new always requires courage
    and my courage would be less after reading your lines Skymmer.
    Please check if you really wanted to write about quality or make a usefull suggestion.

  26. The Following 6 Users Say Thank You to Stephan Busch For This Useful Post:

    avitar (22nd March 2015),Bulat Ziganshin (22nd March 2015),Gonzalo (22nd March 2015),milky (10th June 2017),ne0n (22nd March 2015),SvenBent (23rd March 2015)

  27. #17
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    467
    Thanks
    202
    Thanked 81 Times in 61 Posts
    @skymmer: Are you OK, man???
    Did you read what you wrote?
    If you've found bugs, fix them, upload the code and they will thank you too.
    Or write yourself a decent compressor.
    And let the man do his job. Go ahead Fu.

    PS: I recallbeing sad when I saw the word "final" in 3.2...
    Last edited by Gonzalo; 22nd March 2015 at 19:55.

  28. The Following User Says Thank You to Gonzalo For This Useful Post:

    mhajicek (22nd March 2015)

  29. #18
    Member
    Join Date
    Aug 2014
    Location
    Overland Park, KS
    Posts
    17
    Thanks
    0
    Thanked 0 Times in 0 Posts
    CSArc compresses enwik9 to 213298276 bytes in approx 500 seconds and decompresses in 83(1/3) seconds on my linux and compresses enwik8 to 24593696 bytes in approx 50 seconds and decompresses in approx 8(1/3) seconds on my linux
    The method I used was -m5 -d64m -f

  30. #19
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    10GB results (system 4): http://mattmahoney.net/dc/10gb.html (-m5 -d512 -t1 makes Pareto frontier for decompression speed).
    LTCB: http://mattmahoney.net/dc/text.html#2040 (moved up about 25 spots).
    Silesia: http://mattmahoney.net/dc/silesia.html (only tested max compression).

    All files decompressed OK. I tested by compiling from source in Ubuntu, g++ 4.8.2. On 10GB, file dates but not directory dates were restored. (This can happen when updating directory contents after setting the date). Also, it seems I needed -f to create a new archive.

    zpaq v7.01 and later are public domain except for libdivsufsort in libzpaq, which is MIT license (in case you want to use a suffix array to find matches). Personally I don't care if you used GPL code from earlier versions.

  31. #20
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    467
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Also, it seems I needed -f to create a new archive.
    On Windows too

  32. #21
    Member
    Join Date
    Sep 2011
    Location
    uk
    Posts
    237
    Thanks
    187
    Thanked 16 Times in 11 Posts
    Quote Originally Posted by Gonzalo View Post
    On Windows too
    IMO the -f seems ott - most compressors just overwrite. J

  33. #22
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    An archiver like zip, rar, 7zip, or freearc would normally update an existing archive by adding or replacing files. But not every archiver supports this. For example, pcompress, nanozip, and most PAQ versions are like csarc that they cannot update an existing archive. They can only create new ones. (Of course the ideal case is to keep both old and new versions like obnam and zpaq )

  34. #23
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    467
    Thanks
    202
    Thanked 81 Times in 61 Posts
    Quote Originally Posted by avitar View Post
    IMO the -f seems ott - most compressors just overwrite. J
    Dangeroussssss!
    It is actually needed, in some situations, mostly in testing. But absolutely out of place here, because you are creating an archive, which doesn't exists at all, and csa insist in overwrite something. Mistake I think.
    Also, a small typo: "experimential archiver"

  35. #24
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Thank you very much! I may try it in some future time. However 8s may be a little long for m1, m2 mode, in which the total time was about 10s.
    This is a really good advice.

    Quote Originally Posted by Mat Chartier View Post
    Very nice, enwik8 to 24,593,703 in 59.5s and decomp in 2.9s (-m5 -d64).

    I'm working on some dynamic dictionary stuff for mcm v0.83 and found that the following approach works pretty well (in case you're interested in improving on your dictionary preprocessing):
    use capital conversion for first char / whole words
    count the frequencies of all words 3 <= len < xx
    filter out all the words which don't occur at least ~5-10 times
    sort in descending order by using (word.length - 1) * word.freq
    pick top 20 words for 1 byte codes (starting at byte 128 )
    pick next top (128-20) * 128 words for 2 byte codes
    after assigning 1 byte codes, sort the words that will have 2 byte codes lexicographically
    assign 2 byte codes (using bytes 148+, 128+) using the lexicographically sorted words
    using this approach, you should be able to get enwik8 to ~23.6m, dictionary construction takes around 8s for me.

  36. #25
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Already put on Github


    Quote Originally Posted by Bulat Ziganshin View Post
    afair zpaq changed the license to BSD since Matt left Dell. also, can you use github for hosting your code?

  37. #26
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    CSArc already treats all files in same extension as a solid block.
    Put all in one block will invalidate the multi-threading in current design.

    I just put the x64 version, which allows more memory consuming.
    BTW, did you public your test data anywhere? I am a little surprised on some corpus CSArc33 was worse than CSC32.

    Quote Originally Posted by Nania Francesco View Post
    In WCC benchmark 2015 (with Intel Core I7 920 2,66ghz 6GB ram)
    CSARC 3.3 fails (program go in crash) with option "a -r -f -m5 -d512m -t4 "
    Why not enable solid mode?

  38. #27
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    I used some OS-related codes in ZPAQ, they were very similar or same for different versions.

    Quote Originally Posted by nemequ View Post
    lWhat version of ZPAQ did you copy from? If it was pre-7.01, would it be difficult to replace it with code from the current ZPAQ?

  39. #28
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Thank you!
    Interesting result. Isn't DVD movie data already compressed? Then how it can be compressed so much?

    It is normal that CSArc is a bit weaker than 7-zip. However I am surprised BSC outperformed so much, of both csarc and 7-zip.
    Quote Originally Posted by a902cd23 View Post
    Test of DVD-movie from 1TB HDD <-> RamDisk Z:
    ..

  40. #29
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Seems fixed this time.
    Quote Originally Posted by xezz View Post
    if file path has multi byte character, can't open file.

  41. #30
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    Quote Originally Posted by Skymmer View Post
    1. The sources are not compilable. I wonder how the author was managed to create the binaries, just because there is an error in csa_file.h which causes the following message:
    Code:
    cannot pass objects of non-trivially-copyable type
    Though it's fixable with c_str.
    Yeah, this is really a low level mistake. I don't know why the Visual C++ allows it to compile. nemequ compiled it with UNIX context, so the piece of bug code was ignored in his case.

    Quote Originally Posted by Skymmer View Post
    2. Clumsy multithread implementation. I tried to archive the folder (8592 files, 3.27GB) using csarc.exe a -f -r -m5 -d1m -t8 TEST.cs d:\MingW\IC\*
    At the beginning CSArc really utilizes all of the cores but the number of utilized cores is quickly dropping to a value of 1 and such behaviour results in following stats:
    Code:
    Process Time     :         754.578s
    Clock Time       :         415.765s
    Is it normal scalability for 8 threads? No.
    I wish to make the engine multi-threading, like lzham / RAR / 7-zip. I am just not experienced enough to do that. Split files by different extension then do it by multi-threads is a simpler way.
    If most files are homogeneous, or with the same extension. The multi-threading will not improve the speed much.[/QUOTE]

    Quote Originally Posted by Skymmer View Post
    3. There is no support for big files. I dont know what the limit is but when I tried to compress a 9.69GB file with -m1 -d1m -t1, CSArc just crashed at the end.
    Interesting thing is that the whole file have been read (IO Read: 10168226 KB) but console progress bar reports only 17% which simply means that it's programmed with bugs too.
    Output file has no header and you can't extract even a byte from it..
    This is really really an important bug! It only happens under 32-bit, because I thought size_t was 8-byte under 32-bit, I have fixed it now. [/QUOTE]

    Quote Originally Posted by Skymmer View Post
    4. CSArc is NOT LOSSLESS.
    Code:
    C:\ARC\CSC\csarc3.3\32 csarc.exe x -o g:\TEMP\OUT TEST.cs
    CSArc 3.3, experimential archiver by Siyuan Fu
     (https://github.com/fusiyuan2010)
    [==>                                               ]   6% done******** d:\MingW\IC\Compiler/11.1/072/ipp/em64t/lib/ippjmergedem64t.lib extraction/verify failed
    [=====>                                            ]  13% done******** d:\MingW\IC\Compiler/11.1/072/ipp/em64t/lib/ippvmmergedem64t.lib extraction/verify failed
    [==========>                                       ]  23% done******** d:\MingW\IC\Compiler/11.1/072/ipp/ia32/lib/ippsmerged_t.lib extraction/verify failed
    [=============>                                    ]  29% done******** d:\MingW\IC\Compiler/11.1/072/ipp/ia32/lib/ippimerged.lib extraction/verify failed
    [==============>                                   ]  30% done******** d:\MingW\IC\Compiler/11.1/072/ipp/ia32/lib/ippmmerged_t.lib extraction/verify failed
    [================>                                 ]  34% done******** d:\MingW\IC\Compiler/11.1/072/ipp/em64t/lib/ippgenmergedem64t.lib extraction/verify failed
    [==================================================]  100% done
    
    Process ID       : 3140
    Thread ID        : 3692
    Process Exit Code: 0
    Thread Exit Code : 0
    Should I comment it?
    Also please note the exit code.
    Sorry again to hear that, will you please show me how to make this happen again? On what kind of data?
    For the exit code, I have fixed it.

    Quote Originally Posted by Skymmer View Post
    Impressive joke I must say. Self irony is the good skill. Anyway, sorry to tell you that you didn't make a single step forward.
    Also about x64-win. Its easy to create it but I think there is no point to publish compiles of junk software.
    I understand the angry when people found something that has lots of malfunctions. It happens to me too. I will try my best to improve it. The only thing is I have very limited time working on this.

  42. The Following 2 Users Say Thank You to Fu Siyuan For This Useful Post:

    avitar (27th March 2015),Matt Mahoney (27th March 2015)

Page 1 of 2 12 LastLast

Similar Threads

  1. lz77 visualisation
    By chornobyl in forum Data Compression
    Replies: 3
    Last Post: 7th June 2016, 16:04
  2. LZ77 Variation Idea
    By Kennon Conrad in forum Data Compression
    Replies: 21
    Last Post: 27th September 2014, 19:52
  3. LZ77 on GPU
    By Bulat Ziganshin in forum Data Compression
    Replies: 2
    Last Post: 31st July 2014, 00:52
  4. Byte oriented LZ77 compressor
    By Matt Mahoney in forum Data Compression
    Replies: 21
    Last Post: 30th December 2011, 17:27
  5. LZ77 Performance Issues
    By david_werecat in forum Data Compression
    Replies: 4
    Last Post: 23rd August 2011, 18:06

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •