View Poll Results: Should I release BCM v2.00 (not comparible with v1.xx)?

Voters
29. You may not vote on this poll
  • Yes

    28 96.55%
  • No

    1 3.45%
Page 3 of 4 FirstFirst 1234 LastLast
Results 61 to 90 of 111

Thread: BCM - The ultimate BWT-based file compressor

  1. #61
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    The open source BCMv1 is about one year old now. Previously, all versions of BCM were incompatible with each other. It's slow&strong BCM v0.09, pretty balanced BCM v0.14, also I have unreleased FSM-based post coder (BWTANK) which is pretty cool.
    Anyway, I think the best candidate for BCMv2 is optimized BCM v0.14!
    As to CRC, unlike simple formats like LZ4, BCM can detect stream errors with no checksums. Thanks to it slightly redundant block size/bwt idx coding.

  2. #62
    Member
    Join Date
    Aug 2015
    Location
    Urbana, IL
    Posts
    152
    Thanks
    10
    Thanked 157 Times in 86 Posts
    Replacing putc with putc_unlocked(gcc)/_putc_nolock(MSVC) and getc with getc_unlocked/_getc_nolock makes compression and decompression 6% faster.

  3. Thanks:

    encode (18th April 2016)

  4. #63
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Code:
    #define _CRT_DISABLE_PERFCRIT_LOCKS
    does it automatically with MSVC...

    As to GCC, well, I guess we need a solution to hit both compilers - either with undef/define getc/putc or simply using _putc_nolock and define it with GCC
    Code:
    #define _putc_nolock putc_unlocked
    #define _getc_nolock getc_unlocked

  5. #64
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Well, after quite intense re-optimization and research I decided to keep the BCM as is.
    I got pretty interesting results with the byte-oriented arith encoder and a MTF-like transform, but this must be a new BWT-based file compressor.
    Yes, RC with carries can be slightly faster at decoding, especially with 32-bit compile. Yes, I can have a higher compression at the same speed - but this is mostly a file-tuning matter.
    And yes, BCM can be stronger compression-wise, but the CM complexity makes BWT-stage unnecessary - the standalone model gives a higher compression ratio anyways.
    However, the byte-oriented encoder results led me to a renewed PPMX compressor. Thus, I will release a new PPM-based file compressor very soon (PPMX v0.10)!

  6. Thanks:

    Mike (30th April 2016)

  7. #65
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    BTW, it is possible to add optional CRC checking while keeping the full compatibility. The "-c" switch or something like that.
    CRC32 is the most used checksum in the data compression world, however the newest CRC32C, which has a hardware implementation, looks like a strong competitor here.

  8. #66
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    FARSH will be faster on most cpus

  9. #67
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    CRC32 is fast enough for most applications (especially slicing-by-4/8 implementations), hardware implementation is an additional plus here. In my opinion, we should keep away from Adler32, xxHash, <type your hash here>. Data should be checked by well known and tested algorithms, trusted by users! Yes, Adler32 is well known algorithm, but it has well known weaknesses...

  10. #68
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    During PPMX vs BCM testing I sick and tired of relaunching BCM with the "-f" switch - so I decided to release BCM v1.04 with overwrite prompt. In addition, I increased the default block size to 32 MB!

    Honestly, BCM performance ruined my enthusiasm about PPMX improvement (will release PPMX very soon anyway)

    https://github.com/encode84/bcm

    Also I'm planning to release a BCM with optional CRC checking as a beta test version soon...

  11. #69
    Member
    Join Date
    Aug 2015
    Location
    Urbana, IL
    Posts
    152
    Thanks
    10
    Thanked 157 Times in 86 Posts
    Quote Originally Posted by encode View Post
    CRC32 is fast enough for most applications (especially slicing-by-4/8 implementations), hardware implementation is an additional plus here. In my opinion, we should keep away from Adler32, xxHash, <type your hash here>. Data should be checked by well known and tested algorithms, trusted by users! Yes, Adler32 is well known algorithm, but it has well known weaknesses...
    If you haven't heard of it yet: There is also a hardware implementation of crc32 based on this paper and implemented here and here. In my tests, it is 50% faster than slicing by 8.

  12. Thanks:

    Bulat Ziganshin (9th May 2016)

  13. #70
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    I knew this...

  14. #71
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    It is a common misconception to say that CRC32 must be better because it is "known for a longer time".
    This is misunderstanding the initial objectives of CRC32.

    If you test CRC32 and compare with a good quality hash, you'll notice that CRC32 has worse dispersion qualities than the hash.
    This is not an error, nor a horrible weakness : CRC32 was created for a different purpose, in a time when bit-errors were real issues.

    What CRC32 offers is a guarantee that if 2 chains are different by <= N bits (with N depending on CRC32 variant), they will necessarily generate a different results.
    in contrast with a good 32-bits hash which offers a flat collision probability of 1/4Billions whatever the difference.

    CRC32 pays this guarantee on short-range difference by providing worse collision probabilities for all errors > N bits !
    (by about 3x last time I checked)

    So, when nowadays do we generate short errors ?
    This is no longer 1980's, we no longer risk electrical transmission errors of a few bits. Error detection and correction algorithms are everywhere.
    Nowadays, either data gets through fine (most of the time), or it gets completely mangled.

    Same for compression algorithm : while in some cases, an error on 1 or 2 bytes is theoretically possible,
    in practice, and especially for higher compression algorithms with an entropic layer, when an error starts to appear, it quickly propagates.
    So it's the whole file / packet after the first error which becomes non sense, way more than a few bits.

    Consequences for an error detection algorithm :
    - If your application is prone to generate / receive small errors (just a few bits), using CRCxx to guarantee detection is still a sensible choice.
    - If not, be aware that CRCxx has worse collision characteristics than a good hash.

  15. Thanks:

    Bulat Ziganshin (9th May 2016)

  16. #72
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    69
    Thanks
    32
    Thanked 22 Times in 15 Posts
    If a hash function designed to detect errors with many bits changed,then i think a simple 32bit xor is enough.
    A 32 bit xor can detect any error if the number of bits changed is odd,also any error where the bits changed are continuous and not multiple of 64 and if the corrupted file is completely random(many bits changed) the probability of not detecting it is 2^-32.

  17. #73
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Since Alexander Rhatushnyak pointed me to an SSE initialization flaw, I have a big itch to fix it.
    So, please have a look at BCM v1.09FIX:
    • It is not compatible with the current BCM
    • It has a mandatory CRC32C checking, for the DEBUG purposes and verification, BCM will print out the CRC value

    Please test it, and tell me what you think!

  18. #74
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    1.09FIX2 version:
    Now I'm not compressing block_size/bwt_pointers with CM. I'm storing it as a 32-bit integer via range encoder (as well as newly added CRC). Please test it with smaller blocks as well!
    Maybe I should add a variable bit-length coding to this, since now we have the CRC32C here!

  19. #75
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    BCM v1.09RC - a final fix - featuring slightly re-tuned model - in some cases the compression gain is notable, however, in some cases we slightly loosing compression... A mixed bag, but I'm leaning towards this RC!

    THE ROADMAP:
    • For a couple of weeks, please test FIX2 & RC test builds
    • I'll keep one and release it to the GitHub as a BETA
    • After one month or so, I will change the release status to stable - this is the final and the true BCM v1


  20. Thanks (2):

    Euph0ria (14th May 2016),Samantha (23rd May 2016)

  21. #76
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Okay, a new version is here: https://github.com/encode84/bcm

    • Added a standard CRC32 checking
    • Slightly retuned the CM model
    • Changed the license to MIT (to match libdivsufsort and I feel more comfortable with it, compared to Public Domain)

    Happy testing!

  22. Thanks (3):

    joerg (26th May 2016),Matt Mahoney (28th May 2016),Mike (25th May 2016)

  23. #77
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    Congratulation encode! Your new bcm works very well!

    c:\COMPR-2016\PGM>bcm ..\dta\backup.dat ..\result\l-bcm
    Compressing ..\dta\backup.dat: 64442763264->7619857328 in 11942.0s
    CRC = DF033EE5

    c:\COMPR-2016\PGM>bcm -d ..\result\l-bcm ..\dta\backup.dat
    Decompressing ..\result\l-bcm: 7619857328->64442763264 in 10775.8s
    CRC = DF033EE5

    ---
    the compression performance is compareable to 7zip

    but if we want to use the program for real backup purposes,
    then you should keep not only the filename but the timestamp too

    may be like bzip2 it does:
    backup.dat and backup.dat.bcm -- should have the identical timestamp

    and decompression should produce the identical timestamp too
    ---

    I have two questions:

    1. bcm use only one core ?
    2. compression and decompression is symmetrical , this means needs the same time "plus minus" ?

    best regards

  24. Thanks:

    encode (26th May 2016)

  25. #78
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    I hear you about the file time/date stamps.

    1. Yes, the BCM is single-threaded. Since it is designed to compress a file in a single block (or at least I'm keeping that in mind, assuming that a user will set the block size to max allowed value). As a note libdivsufsort has OpenMP support to accelerate BWT-stage.
    2. Compression is just slightly slower - thanks to the excellent libdivsufsort library! Anyway, it's not LZ77-based compressor, so think of it as "compression time" ~= "decompression time".

    As to compression performance vs 7-zip: BCM is a BWT-based file compressor - on some files (text files/pictures/already compressed data/...) LZMA cannot compete with BCM, yes 7-Zip has PPMd for such files, but BCM is stronger than PPMd in many cases.


  26. #79
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    @encode: thank you for your quick answer

    for me it is good, that bcm use only one core - so i can start it on a server with several cores and the other cores will work as before.

    thinking about "keeping timestamp" and "archive format" -

    another way is - may be someone can write a plugin for using it within 7zip as a custom codec plugins

    like rgeldreich has done for LZHAM : http://encode.su/threads/15-7-Zip?p=...ll=1#post45716

    best regards

  27. #80
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    As to 7-zip plug-in and such, after some period of time I will do it myself!

    I need some time to test and finalize the BCMv1 format. Haste makes waste! Previously, I was in hurry to release BCMv1 to SourceForge - getting prepared to my marriage and moving to another appartment - I was afraid of abandoning the BCM and all my compression stuff. So, BCM v1.10+ is sort of finalized BCMv1 format!

  28. #81
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    BCM v1.20 beta is here: http://github.com/encode84/bcm

    Now BCM preserves original access/modification time of a file.

  29. Thanks (2):

    joerg (9th June 2016),Mike (8th June 2016)

  30. #82
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    @encode: can you please append the windows binary for a download within the forum ?

    best regards

  31. #83
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Here you go:
    Attached Files Attached Files

  32. Thanks:

    joerg (13th June 2016)

  33. #84
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    the program works fine , but produces an error message (i dont know why)

    c:\COMPR-2016\PGM>bcm ..\dta\backup.dat ..\result\lab-bcm
    Compressing ..\dta\backup.dat: 65718343680 -> 7433016230 in 13346.39s
    Stat failed: No such file or directory

  34. #85
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Quote Originally Posted by joerg View Post
    the program works fine , but produces an error message (i dont know why)

    c:\COMPR-2016\PGM>bcm ..\dta\backup.dat ..\result\lab-bcm
    Compressing ..\dta\backup.dat: 65718343680 -> 7433016230 in 13346.39s
    Stat failed: No such file or directory
    BCM can't read the input file timestamp.
    Try to pass filenames with no "..":
    Code:
    > cd ..
    > bcm dta\backup.dat c:\compr-2016\result\lab-bcm

  35. #86
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    Thank you encode for the hint!

    But (with no "..")
    ---
    c:\COMPR-2016\DTA>bcm backup.dat lab-bcm
    Compressing backup.dat: 65718343680 -> 7433016230 in 13004.90s
    Stat failed: No such file or directory
    ---
    it is the same

    would it be possible to use the program without the complete "long" filename ?

    here a have a big backup.dat > 60 GBytes

    if i am using a short backup.dat then it works correctly without the error message

    and the resulting file has the same timestamp as the sourcefile
    ---
    c:\COMPR-2016\DTA>bcm backup.dat lab-bcm
    Compressing backup.dat: 5151328 -> 5173242 in 1.16s
    CRC = DB5726B0
    ---

    best regards

  36. #87
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Got it!
    And here is a new update - BCM v1.21 beta: https://github.com/encode84/bcm

    As an experiment, added a progress indicator for compression (progress updates per each block).

    Please tell me what you think!

  37. #88
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    Quote Originally Posted by encode View Post
    update - BCM v1.21 beta: https://github.com/encode84/bcm

    .. added a progress indicator for compression (progress updates per each block).
    now the resulting file has the same timestamp as the sourcefile:
    ---
    c:\COMPR-2016\DTA>bcm backup.dat lab-bcm
    Compressing backup.dat:
    65718343680 -> 7785867513 in 13206.18s
    CRC = CDE9BC47
    ---
    but ???
    in the past: "65718343680 -> 7433016230"
    and now??: "65718343680 -> 7785867513"
    ---
    i must do further test ...

    may be it is better to do only one modification in one step ...

    best regards

  38. #89
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Quote Originally Posted by joerg View Post
    but ???
    in the past: "65718343680 -> 7433016230"
    and now??: "65718343680 -> 7785867513"
    I changed the default block size from 64 MB to 32 MB (160 MB of RAM required).
    Please note that you can change the block size with "-b" option:
    Code:
    bcm -b320 backup.dat

  39. #90
    Member
    Join Date
    May 2008
    Location
    Germany
    Posts
    410
    Thanks
    37
    Thanked 60 Times in 37 Posts
    Quote Originally Posted by encode View Post
    I changed the default block size from 64 MB to 32 MB (160 MB of RAM required).
    Thank you very much encode - for your quick answer.
    Now it works as expected:
    ---
    c:\COMPR-2016\DTA>bcm -b64 backup.dat lab-bcm
    Compressing backup.dat:
    65718343680 -> 7433016230 in 12778.48s
    CRC = CDE9BC47
    ---
    and the resulting file has the same timestamp as the source file
    ---
    i want to ask:

    -b32 means: the program needs 160 MByte RAM
    -b64 means: the program needs 320 MByte RAM ?
    -b128 means: the program needs 640 MByte RAM ?
    -b256 means: the program needs 1280 MByte RAM ?

    best regards

Page 3 of 4 FirstFirst 1234 LastLast

Similar Threads

  1. BCM v0.09 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 22
    Last Post: 6th March 2016, 10:26
  2. BCM 0.11 - A high performance BWT compressor
    By encode in forum Data Compression
    Replies: 44
    Last Post: 29th October 2010, 23:45
  3. BCM v0.08 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 78
    Last Post: 12th August 2009, 11:14
  4. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 16:47
  5. Blizzard - Fast BWT file compressor!!!
    By LovePimple in forum Data Compression
    Replies: 40
    Last Post: 6th July 2008, 15:48

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •