Page 1 of 3 123 LastLast
Results 1 to 30 of 90

Thread: BCM 0.12 is here!

  1. #1
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts

    Post BCM 0.12 is here!

    Okay, new version has been released! BCM 0.12 features all new and a highly optimized CM post-coder.

    New BCM is a fastest BCM ever!

    Enjoy!
    Attached Files Attached Files

  2. #2
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Code:
        file       size   |    bcm         bcm     |    bsc         bsc    
      mozilla |  51220480 |  16001429 | 15.604 sec |  15948277 | 11.435 sec
      webster |  41458703 |   6423933 | 15.129 sec |   6432894 |  8.564 sec
          nci |  33553445 |   1235163 | 10.010 sec |   1204389 |  4.243 sec
        samba |  21606400 |   4032802 |  5.720 sec |   3925261 |  3.448 sec
      dickens |  10192446 |   2243855 |  3.265 sec |   2254916 |  2.153 sec
         osdb |  10085684 |   2275870 |  3.181 sec |   2203400 |  2.137 sec
           mr |   9970564 |   2121652 |  2.894 sec |   2195307 |  1.918 sec
        x-ray |   8474240 |   3669155 |  2.413 sec |   3747896 |  2.308 sec
          sao |   7251944 |   4685419 |  2.295 sec |   4669768 |  2.449 sec
      reymont |   6627202 |    984574 |  1.794 sec |    978277 |  1.107 sec
      ooffice |   6152192 |   2547352 |  1.559 sec |   2555019 |  1.451 sec
          xml |   5345280 |    395319 |  1.131 sec |    379577 |  0.531 sec
        total   211938580 |  46616523   64.995 sec |  46494981   41.744 sec
    * bsc 2.4.0 was tested with disabled preprocessing and multi-core systems support.
    Enjoy coding, enjoy life!

  3. #3
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Well, BSC uses QLFC and has <1 range coder call per input byte. BCM uses CM and has at least 8 range coder calls per input byte! And even taking into account this fact, the BCM is pretty close to the BSC in terms of processing speed. Anyway, both have strong and weak sides. QLFC adds some redundancy, CM is computationally expensive but has no redundancy. This fact in some cases make notable difference, even if we compare QLFC to super light CM as BCM make use:

    A few examples:

    book1 (768,771 bytes)
    bsc 2.4.0 -> 213,118 bytes
    bcm 0.12 -> 210,794 bytes

    butterfly4.bmp (18,048,054 bytes)
    bsc 2.4.0 -> 5,028,243 bytes
    bcm 0.12 -> 4,766,017 bytes

    rafale.bmp (4,149,414 bytes)
    bsc 2.4.0 -> 777,127 bytes
    bcm 0.12 -> 755,703 bytes

    hellsky2.tga (7,503,794 bytes)
    bsc 2.4.0 -> 655,210 bytes
    bcm 0.12 -> 635,812 bytes

    mptrack.exe (1,159,172 bytes)
    bsc 2.4.0 -> 517,432 bytes
    bcm 0.12 -> 511,333 bytes

    etc.

    *bsc 2.4.0 was tested with preprocessing enabled

  4. #4
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by encode View Post
    Well, BSC uses QLFC and has <1 range coder call per input byte. BCM uses CM and has at least 8 range coder calls per input byte!
    Well, you can use RLE-EXP to archive less number of range coder call with CM as well.
    Enjoy coding, enjoy life!

  5. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Compared to bcm 0.11, 0.12 compresses 20% faster and decompresses 25% faster with 0.2% worse compression. An acceptable tradeoff I think.

    http://mattmahoney.net/dc/text.html#1686

  6. #6
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts

    Cool

    Thanks for testing.

    As a note, the compression of BCM 0.12 is not always worse compared to previous version. As example, take a look at "fp.log":

    BCM 0.11 -> 559,658 bytes in 10 sec
    BCM 0.12 -> 558,328 bytes in 6 sec

    i.e. the speed improvement also depends on data type.

  7. #7
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Just started working on BCM 0.13. Tested the optimizer and realized what kind of processing power I have - now I can do much more with my new PC - I can do much more complex optimizations, optimize on large files etc. Cool! Anyway, new BCM (or new BWT-based encoder) must be much faster than current BCM. Working on CRUSH I got a few experiments and ideas with bit-codes and such...

  8. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts

    Cool

    Compiled a 64-bit version of the BCM v0.12:

    c:\bcm\Release>bcm -b1000 enwik9 enwik9.z
    BCM 0.12 Experimental BWT compressor
    Copyright (C) 2010 Ilia Muraviev

    Compressing...
    1000000000 -> 164654285 in 281 sec

    c:\bcm\Release>bcm -d enwik9.z e9
    BCM 0.12 Experimental BWT compressor
    Copyright (C) 2010 Ilia Muraviev

    Decompressing...
    164654285 -> 1000000000 in 214 sec

    Nice! I guess all new main versions of BCM will be 64-bit, at last. 32-bit compiles will be kept for compatibility.

    Did a few experiments. Well I guess new BCM must have a higher compression. We already have a speedy BSC. BCM's niche for now is a heavyweight champion...

    Probably also will add a multi-threading.

  9. #9
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    encode:
    I strongly suggest that you should try static Huffman encoding (of order-1 preferably) stage on BWT'd data before actual CM and arithmetic coding, like Francesco was doing in his excellent (well, not counting the numerous bugs AFAIR) Rings compressor. That stage doesn't destroy contexts like RLE and also frees you from separate modelling for RLE part. The advantage is (with pre-Huffman or RLE-whatever) is that you have less bits to encode. I think pre-Huffman should result in comparable number of bits to RLE-whatever, highly speeding up encoding on well compressible data.
    Last edited by Piotr Tarsa; 1st June 2011 at 22:12.

  10. #10
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I added your results. I assume enwik8 is the same, uses 5 GB memory, and executable size is the same (probably not). Anyway it moves up a few spots
    http://mattmahoney.net/dc/text.html#1647

  11. #11
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    About 3.55Mb/s for compression and 4.67Mb/s for decompression for single thread. This is sounds good. I am curious how bsc64 will perform on Intel Core i7-2600. Could you please test it with "-b1000p" and "-b1000pT" options.
    Enjoy coding, enjoy life!

  12. #12
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    As a note, it was just a quick Visual C++ compile to test the 1 GB block compression. Most likely with PGO and/or with ICL the processing speed will be faster.

    Strangely, but BCM 0.12 is weaker than BBB at ENWIK9. New BCM MUST beat BBB and BSC with 1 GB block!

    So, anyways:

    C:\bcm\x64\Release>bsc e enwik9 enwik9.bsc -b1000p
    This is bsc, Block Sorting Compressor. Version 2.6.1. 6 May 2011.
    Copyright (c) 2009-2011 Ilya Grebnov <Ilya.Grebnov@libbsc.com>.

    enwik9 compressed 1000000000 into 163897226 in 108.889 seconds.

    C:\bcm\x64\Release>bsc d enwik9.bsc e9
    This is bsc, Block Sorting Compressor. Version 2.6.1. 6 May 2011.
    Copyright (c) 2009-2011 Ilya Grebnov <Ilya.Grebnov@libbsc.com>.

    enwik9.bsc decompressed 163897226 into 1000000000 in 30.701 seconds.

    C:\bcm\x64\Release>bsc e enwik9 enwik9.bsc -b1000pT
    This is bsc, Block Sorting Compressor. Version 2.6.1. 6 May 2011.
    Copyright (c) 2009-2011 Ilya Grebnov <Ilya.Grebnov@libbsc.com>.

    enwik9 compressed 1000000000 into 163882223 in 163.286 seconds.

    C:\bcm\x64\Release>bsc d enwik9.bsc e9
    This is bsc, Block Sorting Compressor. Version 2.6.1. 6 May 2011.
    Copyright (c) 2009-2011 Ilya Grebnov <Ilya.Grebnov@libbsc.com>.

    enwik9.bsc decompressed 163882223 into 1000000000 in 59.654 seconds.

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    Average run length is different for different volumes of the same text.
    So a direct CM postcoder tuned with 1M or even 10M samples won't be especially efficient for 1G.
    While bsc has a RLE stage, which should be in theory more efficient for large samples.
    Btw, bsc compresses enwik8 to 26,790,117 bytes _before_ entropy coding,
    ie its model/rc only has to compress 214,320,941 bits instead of original 800M,
    so its hard to compete with it using your method.

  14. #14
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Note that BCM 0.12 has extremely simplified CM... Any RLE will add redundancy.

  15. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    Bsc has 3-ary logistic mixing, SSE and state machine contexts, so you won't win with anything simpler.

    Also for long runs _CM_ adds redundancy, because its clearly more efficient to encode a million of zeroes
    with a 4-byte uncompressed number, than with 8 millions of super-optimized rangecoder calls.
    Note that for bwt of enwik9 its a relevant example.

    And RLE doesn't necessarily add redundancy (ie it can be made bijective).
    For example, a code like 0001000011101 -> 3,4,0,0,1 doesn't support multiple encodings of the same string,
    and in fact you can get exactly the same entropy as using a bitwise model - by estimating run length probability
    distributions using that bitwise model.

  16. #16
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Quote Originally Posted by Shelwien View Post
    Bsc has 3-ary logistic mixing, SSE and state machine contexts, so you won't win with anything simpler.
    Just for curiosity, BCM 0.09 has no logistic mixing or state machine contexts and its result for ENWIK9 with 1 GB block is 163,012,001 bytes! Anyway, it's still weaker than I expected...

  17. #17
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    As to your RLE example, after that run flags we need to code exact byte, and if we use bitwise coding we cannot properly exclude previous symbol (that cannot appear) from coding... To be honest, I tested such idea - coding become much faster but I clearly see that redundancy...

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    You can apply that RLE to bits in bit histories, not to bits in original data order.

  19. #19
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    So it's like keep a real bit history, and a special context?

    Like:

    Bit history = 000111101000

    Context = 111011100011

  20. #20
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    If you use RLE, LZP or whatever prediction, then you can still make bijective encoding. In the simple case where only one value is predicted at a time (which is true for RLE and LZP, not true for eg. SR3), if the value to be encoded and the predicted value differs only on last bit, then you can skip encoding of that last bit altogether. Anyway, I think that RLE skews statistics in some unpredictable way and modelling such output is harder.

    I have used that idea in my private bitwise compressor (for the contest that I've written before) and it works reasonably good. But that compressor was tuned for some specific, hardly compressible data, so I wouldn't draw any conclusions from that results. On the other hand, bytewise coders use symbol exclusion with great success, for example my TarsaLZP (which is generally largely unfinished) or, a much better example, PPM coders like PPMII.

    Btw, bsc compresses enwik8 to 26,790,117 bytes _before_ entropy coding,
    ie its model/rc only has to compress 214,320,941 bits instead of original 800M,
    Does that figure include RLE flags or whatever flags or data that is produced by RLE stage?

    Francesco has used pre-Huffman stage with a great success in his Rings compressor. Here's a part of Matt's description of Rings backend (ie. the stages after STx):
    The transformed data is encoded using MTF (move to front), pre-Huffman coding followed by arithmetic coding.
    Shelwien told me (on IRC) that pre-Huffman after MTF is a stupid idea, but as we can see, it works! Too bad that Rings is closed source. Otherwise it could be possible to also test variant with forward-MTF (as in QLFC) or even without MTF (as Shelwien suggested).


    You can apply that RLE to bits in bit histories, not to bits in original data order.
    What do you mean? Split original data to 8 streams (ie. for each bit position in original data) and then do RLE on that streams separately? Where's the sense?
    Last edited by Piotr Tarsa; 3rd June 2011 at 13:28.

  21. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    To be specific, I meant something like this:
    Code:
    ctx = 0x100 + data[i-1];
    
    for( i=7; i>=0; i-- ) {
      bit = 0;
      if( --run[ctx]==0 ) {
        bit = 1;
        run[ctx] = NumberModel(ctx);
      } 
      cc += cc + bit;
    }
    But obviously its not the only way.

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    > Does that figure include RLE flags or whatever flags or data that is produced by RLE stage?

    Yes, I just counted binary rangecoder calls in my qlfc rip-off.

    > The transformed data is encoded using MTF (move to front),
    > pre-Huffman coding followed by arithmetic coding.
    > Shelwien told me (on IRC) that pre-Huffman after MTF is a stupid
    > idea, but as we can see, it works!

    Surely it works, just that MTF screws up the dependencies much worse
    than RLE, and RLE is more efficient for that kind of data (with long runs)
    than huffman.

    I still think that using o1-o2 bitcode _before_ BWT, and then doing
    BWT and postcoding of precompressed data would be more efficient,
    than applying that bitcode _after_ BWT (but we can't use RLE before BWT).

    > Otherwise it could be possible to also test variant with forward-MTF
    > (as in QLFC) or even without MTF

    Its not any hard to test. There're sources for both optimal ordered bitcode
    and huffman coding (eg. http://www.ctxmodel.net/files/bitcode_v2.rar)
    And there're sources for MTF and qlfc in bsc or http://nishi.dreamhosters.com/u/bsc240_qlfc_v2.rar

  23. #23
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    To be specific, I meant something like this: (blabla some code)
    So, if I'm correct, it's like stable sorting every of those 8 streams using previous byte in original stream as a key and then doing RLE on those 8 sorted streams. And those RLE lengths will be uncorrelated so we'll have more run-lengths to encode. Are you sure it won't make compression worse?

    I still think that using o1-o2 bitcode _before_ BWT, and then doing
    BWT and postcoding of precompressed data would be more efficient,
    than applying that bitcode _after_ BWT (but we can't use RLE before BWT).
    Well, let's use o1 bitcode before BWT and then we can't use libdivsufsort. Great idea!

    Why we can't do RLE before BWT? bzip2 does RLE before BWT and it AFAIK worsens compression :P

    Its not any hard to test.
    I don't have much time now to test (so I'll postpone it), but encoding BWT'd data is very different from encoding normal data. I've tried for example o1 coders, fpaq0f2 or similiar and all of them were significantly worse than even weakest encodings specialized for encoding BWT'd data. Francesco probably did some serious fine tuning in his compressor.

    RLE is more efficient for that kind of data (with long runs)
    than huffman
    Unless you're encoding data where RLE gives less than 1 bit per symbol, then I don't think RLE will give significantly smaller sizes. I wouldn't care so much about files with pre-encoding compression better than 1 bps. Compressing such files is usually bounded by block-sorting.

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    > So, if I'm correct, it's like stable sorting every of those 8
    > streams using previous byte in original stream as a key and then
    > doing RLE on those 8 sorted streams.

    Yes, though from my point of view its RLE on context histories
    in order1 context.
    Also there're 256*255 streams, not 8.

    > And those RLE lengths will be uncorrelated so we'll have more
    > run-lengths to encode.

    More that what and uncorrelated with what?
    Runlengths surely would have correlations within their contexts,
    and also globally (as i said before, average run length is determined
    by data type and bwt block size).

    > Are you sure it won't make compression worse?

    Worse than what? Worse than 208k on book1 probably yes, worse
    than bsc's 212k probably not.

    > Well, let's use o1 bitcode before BWT and then we can't use libdivsufsort. Great idea!

    I didn't write divsufsort and likely won't bother to ever use it.
    Exactly because I can't make it compatible with various tricks and MT.
    But I don't see why it won't be compatible with pre-BWT bitcoding in theory.

    > Why we can't do RLE before BWT?
    > bzip2 does RLE before BWT and it AFAIK worsens compression :P

    We can use it, it just won't do much compression for enwiks and such.
    What I meant that RLE effect before and after BWT is significantly different,
    while with bitcoding its the same.
    But bitcoding also improves sorting speed, so why use it after BWT.

    > I don't have much time now to test (so I'll postpone it), but
    > encoding BWT'd data is very different from encoding normal data.
    > I've tried for example o1 coders, fpaq0f2 or similiar and all of
    > them were significantly worse than even weakest encodings
    > specialized for encoding BWT'd

    Dunno what you tried, but all of my recent coders are tuned to book1bwt.
    See http://encode.su/threads/1153-Simple...ll=1#post23197

    > Unless you're encoding data where RLE gives less than 1 bit per
    > symbol, then I don't think RLE will give significantly smaller sizes.

    Sure, but BWT output with large block sizes is exactly that kind of data.

    > I wouldn't care so much about files with pre-encoding compression
    > better than 1 bps. Compressing such files is usually bounded by
    > block-sorting.

    Don't mix up average bpc and local.

  25. #25
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    More that what and uncorrelated with what?
    Much more than in situation where you do RLE on entire bytes.

    But bitcoding also improves sorting speed, so why use it after BWT.
    Currently not. Current SACAs that operate on untouched data (eg divsufsort) are much faster than ones that operate on precoded data (eg your block-sorting algorithms). Because of that, compressors authors will choose libdivsufsort. There's also another thing to note: Huffman codes should be shorter than optimal ordered bitcodes and probably allow for easier modelling and better compression. And you can't use pre-Huffman before BWT unless you're decoding the symbols when block-sorting.

    Dunno what you tried, but all of my recent coders are tuned to book1bwt.
    Still it's not close even to bcm on enwik8.

    Sure, but BWT output with large block sizes is exactly that kind of data.
    If file is huge (so many BWT blocks are required because of memory constraints) and extremely reduntant, resulting in more than 90 % of savings after compression, then probably it's better to do LZP stage before BWTing. LZP worsens compression (because it destroys contexts), but allows more data to fit into one BWT block (and then more correlations would be found), so in the end the compression could be even better (than without LZP, of course).

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    > Much more than in situation where you do RLE on entire bytes.

    I don't think so, in part because "RLE on entire bytes" with masking
    won't be practical (slow) and without masking it would be redundant.
    Also RLE on bit histories in prefix context should be more efficient
    than bytewise one. For example, for book1 it would immediately cut off
    all bit7=0 and store as one run. In fact I tested something similar
    (RLE in probability contexts) and it actually improved compression
    (though was intended as a speed optimization), and there were million-bit
    run(s) even in enwik8bwt.

    > Currently not. Current SACAs that operate on untouched data (eg
    > divsufsort) are much faster than ones that operate on precoded data
    > (eg your block-sorting algorithms).

    If we'd use order0 bitcode, the only difference would be
    bit pointers instead of byte pointers.
    And for order1+ codes we'd have to include context bytes
    with the pointers.
    Either way, imho there's no reason why "Current SACAs" can't
    be modified to work with compressed data.

    Btw, one interesting possibility for divsufsort is
    "unrolling" - compress data shifts to 16- or 32-bit "cells"
    (or even 8-bit if there's much less than 256 symbols in the data).

    For example, let's compare some strings using book1 o0 huffman codes:
    'b' <1111101> 'c' <101110> 'm' <101111>
    'b' <1111101> 's' <0010> 'c' <101110>
    Here we'd have to compare 2 bytes normally, but with huffman codes
    we'd know that bcm>bsc right after first byte.

    Btw, in this case its also possible to use arithmetic coding and
    prefix contexts within cells... I think its what I tried to do
    in http://nishi.dreamhosters.com/u/rcbwt3_0.rar actually.
    In other words, efficient radix sorting with byte pointers.

    > Because of that, compressors authors will choose libdivsufsort.

    That's ok, for now it doesn't look like anybody besides Yuta Mori
    is able to port divsufsort to GPU or even real MT on CPU.

    > There's also another thing to note: Huffman codes should be shorter
    > than optimal ordered bitcodes and probably allow for easier
    > modelling and better compression. And you can't use pre-Huffman
    > before BWT unless you're decoding the symbols when block-sorting.

    Obviously we can do BWT-of-compressed-data with huffman or any
    other prefix code. I only decided to use these "optimal ordered" codes
    (btw, dynamic programming algo was suggested by Jan Ondrus, naming is mine,
    if you know whether/where its documented, please tell)
    because compression difference with huffman is not so big (5%),
    but with ordered codes we get the "standard" BWT, while with huffman
    the symbol ranks in comparison would be randomly changed, which normally
    would make compression a little worse (~1k on book1)

    > Still it's not close even to bcm on enwik8.

    What do you mean? mix(o0,o1) coders? surely not.
    But mixtest/bwtmix seems ok - http://compressionratings.com/bwt.html#enwik9
    Also see http://compressionratings.com/files/bwtmix-1s.zip
    if you really want Yuta's BWT.

    >> Sure, but BWT output with large block sizes is exactly that kind of data.
    > probably it's better to do LZP stage before BWTing.

    Sure, though I don't see why LZP - something like srep would be better imho
    (obviously you don't need to sort the match info).
    Also its completely orthogonal to bitcodes - its best to use both methods at once.

    > allows more data to fit into one BWT block

    That applies to any kind of compression used before BWT.

  27. #27
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    LZP compressed data format is optimized for texts, REP/SREP format is optimized for binary data

    BSC first splits data into blocks, then applies LZP to each block independently. this allows to 100% utilize multi-core cpus. using REP/LZP before splitting into blocks may slowdown process as a whole

  28. #28
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    with ordered codes we get the "standard" BWT, while with huffman
    the symbol ranks in comparison would be randomly changed
    we can assign huffman codes in the proper order, the only problem is that this needs larger decoding tables

    i.e. with usual Hiffman we can decode most codes using 8-9 bits, and only few of them require secodary table. with "ordered Huffman" even short codes becomes "unaligned" so we need full 16-bit decoding table

    OTOH, it may be decoded as i do in Tornado RangeCoder - i.e. 8-9 bits used to find interval of possible huffman codes and then linear or binary search used to find exact code

  29. #29
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,475
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Obviously we can do BWT-of-compressed-data with huffman or any
    other prefix code.
    If using higher than order-0 prefix codes, then we cannot sort without decoding, I think (ie when using Huffman codes or something like that).

    Either way, imho there's no reason why "Current SACAs" can't
    be modified to work with compressed data.

    (....)

    That's ok, for now it doesn't look like anybody besides Yuta Mori
    is able to port divsufsort to GPU or even real MT on CPU.
    Well, I'm developing my own SACA that will be scalable (at least to few cores) and shouldn't be much slower than divsufsort on single-core CPU.

    The reason that I'm saying that efficient SACAs won't like compressed input is that, taking for example my algorithm, suffices are sorted incrementally, ie firstly by first two symbols, then by three symbols, etc after eg 50 symbols we can stop incrementing by one and begin doubling the radix (as in qsufsort). Radix doubling techinque requires that I know the length of first 'radix' symbols of current suffix, so I can sort equal (ie with equal 'radix' length prefixes) suffices using ranks of the following suffices as keys.

    Generally, fast SACAs tries to reduce memory accesses to minimum, the extreme is qsufsort (from the ones I know), which does alphabet transformation at start (ie. it tries to stuff as many symbols in 4-byte fieds, where each symbol takes the same amount of space), then does bucket sorting which transforms that to SA and ISA, and then operates only on SA and ISA without touching input array at all.

    SACAs are mostly bottlenecked by random memory access (qsufsort, for example, has very random memory access layout, so that's the reason why it's so slow compared to other SACAs) but CPU loads entire cache lines at once. While loading cache line from memory can take tens or hundreds of cycles, scanning a cache line is essentially latency-free.

    Also I don't think parallelizing Yuta's divsufsort would be easy. Step 3 and Step 4 from his Improved Two-Stage technique seems unparallelizable to me. The description is here: http://homepage3.nifty.com/wpage/software/itssort.txt . As a side note, I've improved Seward's Copy technique, which seems relatively scalable, here are the results:
    Code:
                       Copy ascending      Copy simple-ratio   Improved two-stage
    chr22.dna          35.56               31.67               26.68
    etext99            49.75               38.64               31.10
    gcc-3.0.tar        42.63               33.46               26.77
    howto              45.51               36.13               28.55
    jdk13c             47.63               35.05               30.06
    linux-2.4.5.tar    44.12               35.20               26.80
    rctail96           49.00               36.57               30.28
    rfc                40.17               32.28               25.83
    sprot34.dat        43.14               38.07               29.26
    w3c2               47.46               36.22               29.90
    
    mean               44.50               35.33               28.52
    Copy ascending is the usual Copy implementation, where buckets are sorted in order of increasing size.
    Copy simple-ration is my variant of Copy technique; I've used simple heuristic to find the sorting order, maybe with something smarter I would be able to shave of another few percent and come really close to ITS
    Improved two-stage is the technique used in divsufsort and msufsort.

    I have lots of ideas on how to improve my QRotSort algorithm, so I'll postpone my master thesis to the end of holidays, so I'll have time to implement those ideas.

    BSC first splits data into blocks, then applies LZP to each block independently. this allows to 100% utilize multi-core cpus. using REP/LZP before splitting into blocks may slowdown process as a whole
    May or may not. I'm pretty sure that LZP stage is order of magnitude faster than BWT + QLFC stage, so one thread could do LZPing and other threads/ cores could do further stages.
    Last edited by Piotr Tarsa; 3rd June 2011 at 20:08.

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,423
    Thanks
    223
    Thanked 1,052 Times in 565 Posts
    > LZP compressed data format is optimized for texts,
    > REP/SREP format is optimized for binary data

    I meant that LZP is relatively slow and symmetric - it
    has to do hash lookup on each data byte basically.
    So rep/srep and (especially) my anchor hashing should
    be much faster for this, and this stage only has to
    remove very long matches, so imho it doesn't matter
    whether its less precise than LZP.

    > BSC first splits data into blocks, then applies LZP to each block
    > independently. this allows to 100% utilize multi-core cpus. using
    > REP/LZP before splitting into blocks may slowdown process as a whole

    I don't see why, it should be a stream normally anyway, so it doesn't
    matter how many stages pipeline has.
    Although Gribok mentioned reordering of written blocks, but that's
    basically encoding time optimization at the expense of decoding time,
    so its not likely a good idea.

    >> with ordered codes we get the "standard" BWT, while with huffman
    >> the symbol ranks in comparison would be randomly changed

    > we can assign huffman codes in the proper order, the only problem is
    > that this needs larger decoding tables

    I think you're missing something.
    I was talking about compressing the data before BWT.
    Basically, if we'd compress the data with static bitcode and
    compare compressed strings, its possible to compute the same transform,
    but compare shorter strings on average.
    But the sort order for huffman codes is different from order
    of original symbol codes, and such "random" sort order usually
    hurts compression a little, so its worth consideration to use
    a different kind of prefix codes, such that they preserve the
    original symbols' order. Incidentally, such a code can be
    built with a dynamic programming algo (unlike huffman code),
    which is why i'm calling it "optimal ordered".

Page 1 of 3 123 LastLast

Similar Threads

  1. BCM v0.10 is here!
    By encode in forum Data Compression
    Replies: 45
    Last Post: 20th June 2010, 22:39
  2. BCM v0.06,0.07 is here! [!]
    By encode in forum Data Compression
    Replies: 34
    Last Post: 31st May 2009, 17:39
  3. BCM v0.05 is here! [!]
    By encode in forum Data Compression
    Replies: 19
    Last Post: 8th March 2009, 22:12
  4. BCM v0.04 is here! [!]
    By encode in forum Data Compression
    Replies: 64
    Last Post: 5th March 2009, 17:07
  5. BCM v0.03 is here! [!]
    By encode in forum Data Compression
    Replies: 25
    Last Post: 14th February 2009, 15:42

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •