Results 1 to 30 of 79

Thread: BCM v0.08 - The ultimate BWT-based file compressor!

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,013
    Thanks
    406
    Thanked 403 Times in 153 Posts

    Cool BCM v0.08 - The ultimate BWT-based file compressor!

    OK, a new release is here. A hardcore work. Anyway, check out what's new:
    • This is a complete rewrite of the compressor
    • Default block size now is 64 megabytes
    • Invented new conception in BWT-output encoding. Still no cheating or tricks like M03
    • Many optimizations and code cleanups
    So, it was really hard to beat the BCM v0.07, but I did it! The brand new BCM is one the strongest BWT-based compression engines available. A few weeks of running optimizations, some serious analysis on a special contexts and models for BWT-data... Uncounted number of bottles of Vodka and Pepsi...

    Some brief results:

    book1 -> 209,826 bytes
    calgary.tar -> 779,619 bytes
    world95.txt -> 466,250 bytes
    fp.log -> 557,226 bytes
    ENWIK8 -> 20,744,613 bytes
    bible.txt -> 725,091 bytes
    3200.txt-> 3,660,195 bytes

    So anyways, enjoy!

    Attached Files Attached Files

  2. #2
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    Thanks Ilia!

    Mirror: Download

  3. #3
    Member Vacon's Avatar
    Join Date
    May 2008
    Location
    Germany
    Posts
    523
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hello everyone,

    thanks Ilia

    Best regards!

  4. #4
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    20 744 613 on enwik8 using 1024m

  5. #5
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,013
    Thanks
    406
    Thanked 403 Times in 153 Posts
    1024m? No way! It uses 5N!

  6. #6
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode View Post
    1024m? No way! It uses 5N!
    Lol, sorry for misunderstanding I meant using e1024, I have now read the memory usage and, yes, it used one 95m block hence around 475 mb of memory.

  7. #7
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    Quick speed test on my old 750 MHz, 512MB RAM Pentium 3 machine...

    D:\viper bcm e25 enwik8 enwik8.bcm
    VIPER v1.0 beta
    Copyright (c) 2009 by LovePimple

    bcm v0.08 by ilia muraviev
    encoding 25m block...
    encoding 25m block...
    encoding 25m block...
    encoding 20m block...
    ratio: 1.776 bpb
    done

    Kernel Time = 2.323 = 00:00:02.323 = 0%
    User Time = 309.785 = 00:05:09.785 = 97%
    Process Time = 312.108 = 00:05:12.108 = 97%
    Global Time = 319.028 = 00:05:19.028 = 100%
    The compressed size of enwik8 is 21.1 MB (22,203,843 bytes) with 25MB block size.


    Same test, but with 48MB block size...

    D:\viper bcm e48 enwik8 enwik8.bcm
    VIPER v1.0 beta
    Copyright (c) 2009 by LovePimple

    bcm v0.08 by ilia muraviev
    encoding 48m block...
    encoding 47m block...
    ratio: 1.717 bpb
    done

    Kernel Time = 2.733 = 00:00:02.733 = 0%
    User Time = 321.191 = 00:05:21.191 = 97%
    Process Time = 323.925 = 00:05:23.925 = 98%
    Global Time = 330.094 = 00:05:30.094 = 100%
    The compressed size of enwik8 is 20.4 MB (21,466,782 bytes) with 48MB block size.


    Same again, but with 95MB block size...

    D:\viper bcm e96 enwik8 enwik8.bcm
    VIPER v1.0 beta
    Copyright (c) 2009 by LovePimple

    bcm v0.08 by ilia muraviev
    encoding 95m block...
    ratio: 1.660 bpb
    done

    Kernel Time = 6.669 = 00:00:06.669 = 1%
    User Time = 331.206 = 00:05:31.206 = 57%
    Process Time = 337.875 = 00:05:37.875 = 58%
    Global Time = 581.045 = 00:09:41.045 = 100%
    The compressed size of enwik8 is now down to 19.7 MB (20,744,613 bytes) with 95MB block size.

    Note: Slower global time is mainly due to disk thrashing caused by excessive paging (only 512MB RAM).

  8. #8
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    688
    Thanks
    41
    Thanked 174 Times in 88 Posts
    No doubt its a good news! Thank you Ilia !!!
    Gonna test it today.

  9. #9
    Member
    Join Date
    Jun 2008
    Location
    L?vis, Canada
    Posts
    30
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't think a BWT compressor can be practical:

    a) BWT is not good for executables.
    b) BWT compressors use blocks. With a stream compressor, for example, a web server could compress the first half of a HTML file at the same time the PHP interpreter is writing the second half.
    c) No option to keep broken files.

  10. #10
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Thumbs up

    Quote Originally Posted by encode View Post
    OK, a new release is here. A hardcore work. Anyway, check out what's new:[LIST][*]This is a complete rewrite of the compressor[*]Default block size now is 64 megabytes[*]Invented new conception in BWT-output encoding. Still no cheating or tricks like M03

    just curious what happens if you use BWTS instead of BWT would
    it compress smaller.

    I also was wondering what is the cheating and MO3 stuff your
    talking about.

    Also you did a great job way to go.

  11. #11
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,013
    Thanks
    406
    Thanked 403 Times in 153 Posts

    Smile

    Quote Originally Posted by nanoflooder View Post
    Lol, sorry for misunderstanding I meant using e1024, I have now read the memory usage and, yes, it used one 95m block hence around 475 mb of memory.
    New BCM allows any number as a memory parameter, but:
    • BCM will never use a block size larger than a file
    • If a number is too large or zero, BCM will skip back to the default block size
    • Due to Windows limitations, the max block size is 512 MB, even if you have any large amount of RAM. The max allocation block is 2 GB, even under 64-bit Windows. BCM allocates two chunks of memory ? for pointers, that need 4N memory and for the block itself. So, 512 MB*4 = 2 GB!
    Quote Originally Posted by Mihai Cartoaje View Post
    I don't think a BWT compressor can be practical:

    a) BWT is not good for executables.
    b) BWT compressors use blocks. With a stream compressor, for example, a web server could compress the first half of a HTML file at the same time the PHP interpreter is writing the second half.
    c) No option to keep broken files.
    a) Relatively and depends on actual executable. In some cases it may beat even LZMA at max settings. Anyway, yep, the overall compression is not that good as with texts and similar files.

    b) Not a problem. Use a smaller block size. Network traffic organized into packets. Stream audio and media organized into frames. So, block size is equal to frame or a packet.

    c) It?s a fable about the compressed data recovery. High compression is incompatible with any recovery. Change one byte and you lost everything. Such a feature added to the archivers mostly to keep users calm? Malcolm Taylor (The author of WinRK) some time ago posted a nice message about data recovery. So, in practice, data only with a small damage might be fixed, at the cost of (attention) 10...25% or more extra data. Worth it? It?s about the same to send or store two copies of compressed data and in case of some damage, use undamaged copy. Anyway, the data storage and networks are too reliable these days to get broken files?

    Quote Originally Posted by biject.bwts View Post
    just curious what happens if you use BWTS instead of BWT would
    it compress smaller.

    I also was wondering what is the cheating and MO3 stuff your
    talking about.

    Also you did a great job way to go.
    With no BWT index coded we might win one byte! Better to think about better modeling than about so eccentric ideas?

    A small preamble. New BCM achieves about the same compression power as PAQ8 on BWT data. BCM even notable stronger on BWT-data than a compression monsters of the past times like PAQ6. That means BCM is closely about to the max BWT compression possible! Anyway, we have the M03 idea. It?s not a pure BWT by itself. AFAIK, it is much like CM that uses BWT structures to access the statistics. Most likely BWMonstr uses the same approach. Anyway, M03 and BWMonstr are too way slow ? better use pure CM in such case ? so we will have both faster and stronger compression. In my opinion, the strength of BWT in its speed and efficiency. Making it too slow will kill any of its advantages over CM or PPM. Anyway, BCM is already more CM than BWT. BWT with BCM is intended to help to access to a higher order statistics. One of the problems of CM/PPM is an efficient access to a high order statistics, model becomes larger, and serious cache misses can?t be avoided, we need more complex blending of all models available, and if we go bitwise? all our performance are drop-downed. With BCM I use BWT and a small CM ? the entire CM model fits in less than 300 KB! So it is extremely efficient in terms of caching. Yep, the model is really complex and computationally expensive. But I?ve done as many optimizations as possible ? you will never believe how many stuff works inside BCM, how many sub-models? And all that stuff works such a fast and how it?s memory efficient! A hardcore work and such result! Anyway, I will do some experiments with M03 idea as well?

    Thanks for reading!


  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > So, 512 MB*4 = 2 GB!

    It might be worthy to do some research on chunked storage
    of BWT input - like allocating 1M blocks and remapping
    the indexes into actual offsets via a table.
    Like that it'd be possible to use block sizes up to ~610M
    with 5N BWT (not that blocks more than filesize/2 are any
    useful, but just for a record).
    But actual benefit would be less restrictions when used
    as a part of some GUI framework (these commonly have a
    weird memory fragmentation).

    Then another possibility is a multi-pass BWT (like what
    I did in rcbwt) - filtering out an interval of indexes
    on each pass, and sorting them.
    Its kinda similar to usual blockwise BWT (in speed too),
    but instead of 500M data buffer + 2G index buffer for enwik9,
    we'd use 1G data buffer + 2G index (2861M, should fit with /3GB).

    > Relatively and depends on actual executable

    More specifically, it depends on preprocessing method.
    With disasm32-style preprocessor the compression would
    be pretty good.

    > Anyway, the data storage and networks are too reliable
    > these days to get broken files

    1. Its wrong. Just try to download some large volume of data
    (like rainbow tables) via http or ftp.
    Anyway, its quite probable to get an error within 100G of data,
    as ip packet checksum is only 16bit.

    2. Rar's recovery saved me a few times when uploading video to
    a server - you only need filesize/sectorsize*crcsize + sectorsize
    extra data to fix any single broken sector.

    Though its certainly in no way related to BWT - error correction
    methods are applicable to any kinds of data.

  13. #13
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by encode View Post
    With no BWT index coded we might win one byte! Better to think about better modeling than about so eccentric ideas?
    Well...BWTS+[Start index] is not BWT. BWTS has a slightly different permutation and thus usually makes files a few KiB smaller. Not only 4 bytes (=Start Index).
    BIT Archiver homepage: www.osmanturan.com

  14. #14
    Moderator

    Join Date
    May 2008
    Location
    Tristan da Cunha
    Posts
    2,034
    Thanks
    0
    Thanked 4 Times in 4 Posts

    Thumbs up

    MC SFC test...

    A10.jpg > 824,677
    AcroRd32.exe > 1,562,601
    english.dic > 1,164,850
    FlashMX.pdf > 3,735,285
    FP.LOG > 557,226
    MSO97.DLL > 1,889,122
    ohs.doc > 823,380
    rafale.bmp > 753,524
    vcfiu.hlp > 640,574
    world95.txt > 466,250

    Total = 12,417,489 bytes

  15. #15
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,013
    Thanks
    406
    Thanked 403 Times in 153 Posts
    Don't forget, that on some files the smaller block is better...

  16. #16
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by encode View Post
    Anyway, we have the M03 idea. It?s not a pure BWT by itself. AFAIK, it is much like CM that uses BWT structures to access the statistics. Most likely BWMonstr uses the same approach. Anyway, M03 and BWMonstr are too way slow ?
    M03 isn't slow at all. It's mij4x's implementation which is slow. I have it running about as fast as M99 right now and in 5N space as well. This is for boundless order context modeling. Limited order is really not much faster as most of the time is spent at the lower orders anyhow.

    If I could only find enough time to put the encoder from M99 into modeling of M03 I would be able to release it! But work on the house never ends

    Anyhow, M03 deals strictly with modeling the context boundaries contained within the BWT. There's still lots of room for improvements in how to encode the statistics generated from that modeling.

    I'll see if I can make time in the next week or two to get the fast M03 posted.

    - Michael Maniscalco

  17. #17
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    688
    Thanks
    41
    Thanked 174 Times in 88 Posts
    A couple of tests between v0.08 and v0.07

    Code:
                  | encode time |     size    | decode time |
    --------------------------------------------------------
                  |         PAK0.PAK (302 895 669)          |
    --------------------------------------------------------
    0.07 -b65536  |  181.813    | 117 053 060 |  135.813    |
    0.08 e64      |  227.422    | 116 587 092 |  185.234    |
    --------------------------------------------------------
                  |          ENWIK8 (you know :) )          |
    --------------------------------------------------------
    0.07 -b102400 |   61.265    |  20 770 673 |   45.109    |
    0.08 e100     |   77.828    |  20 744 613 |   61.172    |
    --------------------------------------------------------
                  |         NKOEF.DBF (151 585 377)         |
    --------------------------------------------------------
    0.07 -b32768  |   71.735    |   2 942 131 |   59.453    |
    0.08 e32      |  102.531    |   2 912 917 |   87.000    |
    --------------------------------------------------------

  18. #18
    Member
    Join Date
    Aug 2008
    Location
    Saint Petersburg, Russia
    Posts
    215
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Having seen that ^ I would say that it makes sense to have both versions ready to be run The time difference is quite notable here...

  19. #19
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    688
    Thanks
    41
    Thanked 174 Times in 88 Posts
    Quote Originally Posted by encode View Post
    c) It?s a fable about the compressed data recovery. High compression is incompatible with any recovery. Change one byte and you lost everything. Such a feature added to the archivers mostly to keep users calm? Malcolm Taylor (The author of WinRK) some time ago posted a nice message about data recovery. So, in practice, data only with a small damage might be fixed, at the cost of (attention) 10...25% or more extra data. Worth it?
    From my point of view you're little bit pessimistic giving such rating of data recovery. I just made a little test. I took one album in MP3 format (320 KBit, 69:56 lenght, 167 712 768 bytes) and packed it to RAR archive (Best, Solid, 1% recovery data). Resulting size of RAR archive is 167 998 135 bytes where recovery data takes only 998 912 bytes. Then I zero-filled a 204 800 byte block somewhere in the middle with WinHEX. 204 800 block emulates 100 consecutive unreaded sectors of CD\DVD medium. RAR successfully repaired archive. I think 204 800 bytes its not so small amount of data and 998 912 bytes of recovery data is not too much. Well, 1% only So recovery data can be very helpfull and not too burdensome at the same time. Just my point of view

  20. #20
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by encode View Post
    [*]Invented new conception in BWT-output encoding. Still no cheating or tricks like M03
    I just noticed this line. How is it a "trick" or "cheating" to get full order context modeling of the BWT for encoding?

    M03 doesn't involve any preprocessing, alphabet reordering, RLE or secondary transforms or any filtering of any kind. It's basically optimal parsing of the BWT string along all context boundaries for each context order and nothing more. Hardly "cheating" ...

  21. #21
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,013
    Thanks
    406
    Thanked 403 Times in 153 Posts
    At the first sight M03 looks like CM with BWT-structures to access the statistics...

  22. #22
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,013
    Thanks
    406
    Thanked 403 Times in 153 Posts
    The compressor is featured in the BWT comparison.

  23. #23
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,013
    Thanks
    406
    Thanked 403 Times in 153 Posts
    Currently I'm working under BCM v0.09. And again have many proofs that proper CM is supreme to anything. As example, order-0 CM works better than MTF+order-0 CM. Yep, MTF may help with dummy coders, but with even simple CM it only hurts. MTF stage makes data more stationary, but removes some correlations and adds some noise thus compression with CM degrades very seriously. So, I think the future of BWT-output coding is more advanced CM coders. Furtunately, we have unlimited ideas and ways here. As example, I invented a very special model that is derived from ICM idea. And such standalone model provides 213,886 bytes on book1!!! And it's close to the BBB result, being super-fast, because the complexity of this model is equal to two[!] counters! I'm still under heavy experiments and very creative ideas about BWT-output encoding... Additionally, I'm playing with different contexts for CM and various ideas such as QLFC...

  24. #24
    Member Yuri Grille.'s Avatar
    Join Date
    Mar 2009
    Location
    ****
    Posts
    35
    Thanks
    0
    Thanked 1 Time in 1 Post

    Hi encode.su , you can download a very small BCM 0.8

    This is a optimized bcm in only (48.7 KB)

    [removed]

    Enjoy this good compressor

Similar Threads

  1. BCM v0.09 - The ultimate BWT-based file compressor!
    By encode in forum Data Compression
    Replies: 22
    Last Post: 6th March 2016, 09:26
  2. PPMX v0.05 - new PPM-based compressor
    By encode in forum Data Compression
    Replies: 49
    Last Post: 28th July 2010, 02:47
  3. BCM v0.01 - New BWT+CM-based compressor
    By encode in forum Data Compression
    Replies: 81
    Last Post: 9th February 2009, 15:47
  4. Blizzard - Fast BWT file compressor!!!
    By LovePimple in forum Data Compression
    Replies: 40
    Last Post: 6th July 2008, 14:48
  5. DARK - a new BWT-based command-line archiver
    By encode in forum Forum Archive
    Replies: 138
    Last Post: 23rd September 2006, 21:42

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •