Results 1 to 30 of 30

Thread: BWT - how to use?

  1. #1
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts

    BWT - how to use?

    I tried BWT with several files and several compressors..and it always hurts, usually a lot.
    Does it work well only with certain types of files? Or maybe it needs a specific compressor design? What's up?

  2. #2
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    BWT is not compression and it usually only hopeful on long files since the most common BWT stages in a compressor lengthen the file. Only BWTS does the transform with no additional space in the file even if the file is so long that you need several BWTSed blocks.
    Once you used the transform than you need some stages like MTF with a RLE or Distances Coding followed by a simple entropy coding. If you search this site you can find test results for various follow on stages to BWT or BWTS.
    One advantage of BWTS is that you can also UNBWTS any file. In fact I have tested several files plain vanilla then with the file after UNBWTS and after the plain file BWTSed. For most file compressors the worst compression is after a UNBWTS and then the plain file and the best is usally after a BWTSed file. However the enteresting fact is if you use a true first order entropy compressor like arb255 then any file is compressed to same length */- one byte no matter how many times preprocessed with BWTS or UNBWTS.

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    bwt is great for texts, especially natural-language ones. with html/sources, it's better to add LZP preprocessor before

  4. #4
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by biject.bwts View Post
    BWT is not compression
    Yep, I know.
    Quote Originally Posted by biject.bwts View Post
    (...)and it usually only hopeful on long files since the most common BWT stages in a compressor lengthen the file. Only BWTS does the transform with no additional space in the file even if the file is so long that you need several BWTSed blocks.
    I took a look at your site and I don't see ant download in there...Is there any publicly available compilation?

    Quote Originally Posted by biject.bwts View Post
    Once you used the transform than you need some stages like MTF with a RLE or Distances Coding followed by a simple entropy coding. If you search this site you can find test results for various follow on stages to BWT or BWTS.
    It seems I've been using too strong coders like FreeArc -m9, ccm, PPMDh.
    When I try weaker ones BWT indeed helps.
    I tested with both MFT and RLE and improvements are questionable, they sometimes help, sometimes hurt...Maybe it's my test data, but after a brief look they don't seem to be worth the time.

    Quote Originally Posted by Bulat Ziganshin View Post
    bwt is great for texts, especially natural-language ones. with html/sources, it's better to add LZP preprocessor before
    Thanks, I tried some texts before (Shelwein's booktar), but it hurt almost as bad. I concentrated on texts more now and with weaker coders it's OK.
    Thank you both for the answers.

  5. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    trying to use conventional archivers after bwt transformation is meaningless, of course. if you want to write your own bwt-based compressor, you need to look into special literature (which moistly consists of various thesis papers)

  6. #6
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    trying to use conventional archivers after bwt transformation is meaningless, of course. if you want to write your own bwt-based compressor, you need to look into special literature (which moistly consists of various thesis papers)
    I don't intend to implement my own compressor yet, now I'm just stumbling around to see what works and what doesn't.

  7. #7
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't know what you exactly did, but BWT's output is a sequence of symbols, which are encountered under each possible context, lexically sorted. So similar contexts are grouped together, thus produceing mostly uniform symbol sequences.
    It's not about "weaker" and "stronger" compressors, it's about what data they are made for. BWT output needs special models.

    LZP for BWT preprocessing isn't a good idea since it inserts artifical symbols, which have no lexical order in common with the normal symbols, thus distorting context grouping via sorting, e.g.

    t h e b r o w n f o x -> LZP -> (3) b r o w n (3)

    So the LZP symbol "3" maps to "the" and to "fox" and looses a lexical meaning. While a string substitution which assigns codes to common strings doesn't lose that relation, e.g. the code 3 always maps to "the" and the code "4" always to "fox".

  8. #8
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    At the same time, LZP may improve compression, if we deal with long and distant matches on specific files. Note, we should carefully choose the MINMATCH - it should be long enough. Anyway, with binary files LZP should help a lot. Here the LZP will be kind of a replacement of segmentation.

  9. #9
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    How can LZP be a replacement for segmentation?

    Increasing a threshold for LZP's match length simply lowers the introduced distortion i mentioned. This is still just an improper hack, which doesn't mean it *cannot* help.

    But a better BWT output model will improve compression... I wonder why nobody did something like that: convert the BWT output to variable length codes and encode these, like Huffman, but some unary model might do a better job.

    Anyway did you ever try to compare WRT vs LZP on a text file using BWT?

    Applying n gram models to x86 data is wrong anyway, since the data's alignment doesn't match the model's assumptations, e.g. gap models will be better than. You can directly compare these context masks:
    0xffff, 0xffffff vs 0xffff, 0xffff00 (written down from memory, optimization helps).

  10. #10
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Quote Originally Posted by toffer View Post
    How can LZP be a replacement for segmentation?

    Increasing a threshold for LZP's match length simply lowers the introduced distortion i mentioned. This is still just an improper hack, which doesn't mean it *cannot* help.
    ...kind of replacement...
    I play a lot with modeling, segmentation and LZP-filtering with my BWT compressor. Well, a proper LZP really helps on binary files - EXE or even calgary.tar - even with super CM model. So, a proper transformation like segmentation or LZP-filter really works... It's just I've found during testing - no theory, just rough tests...

  11. #11
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I don't know what you call "super CM" model.

    But most people think that some SSE stages along with different models are magic. It's like it was with NNs in their very popular phase. Everything was done using NNs, even things which could be done better in another way
    ...

    Did you try to compare WRT vs LZP? I bet you didn't. LZP remains a hack here...

  12. #12
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by m^2 View Post
    Yep, I know.

    I took a look at your site and I don't see ant download in there...Is there any publicly available compilation?



    It seems I've been using too strong coders like FreeArc -m9, ccm, PPMDh.
    When I try weaker ones BWT indeed helps.
    I tested with both MFT and RLE and improvements are questionable, they sometimes help, sometimes hurt...Maybe it's my test data, but after a brief look they don't seem to be worth the time.


    Thanks, I tried some texts before (Shelwein's booktar), but it hurt almost as bad. I concentrated on texts more now and with weaker coders it's OK.
    Thank you both for the answers.
    Well my stuff is usually in the form of doing several seperate stages the files are all on the site. If you search this site there is a library called OpenBWT it has code for the bwts transform as well as several others.

    The code as well as executable are in the zip files on my site. There is a lot of stuff there so look harder.

    Shelwein tested bwt bwts and other with several end processors again search this site.

    I orignal tried to follow in the style of Mark Nelson which is why I used several seperate processors. Anyway good luck. With my new president not sure how much longer my site will even exist.

  13. #13
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,979
    Thanks
    376
    Thanked 347 Times in 137 Posts
    Quote Originally Posted by toffer View Post
    Did you try to compare WRT vs LZP? I bet you didn't. LZP remains a hack here...
    Yep, dealing with text files of course. BWT is really OK with text files, using dictionaries may really help, but...
    BWT has the worst performance on binary data - that's why we should have something to fix such behavior...

  14. #14
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    It has worse preformance on binary data since BWT is made for correlations along continuous order N contexts, which have a lexical order. Executable data doesn't have that property, i already mentioned this (gappy contexts, etc...)

    EDIT: binary data = x86 executables.
    Last edited by toffer; 5th November 2008 at 15:54.

  15. #15
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by encode View Post
    Yep, dealing with text files of course. BWT is really OK with text files, using dictionaries may really help, but...
    BWT has the worst performance on binary data - that's why we should have something to fix such behavior...
    Not sure what you mean BWT does bad on binary data. That may be true when using classical 256 symbol BWT but when you use a binary based BWT it is sometimes a good thing no?

  16. #16
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Bijective, I have a bug report for you:
    BWTS hangs on the attached file.
    Attached Files Attached Files
    • File Type: 7z b.7z (296.5 KB, 323 views)

  17. #17
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    I don't know what you did but I ran bwts then unbwts and got same file back.
    Then I did a unbwts followed by bwts and again got the same file back nohang up at all. So I can't recreate the problem you seem to be having.

  18. #18
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Ok, so more precize info:
    1. I compiled your sources, the executables you provided were too big...
    ..but I couldn't use them anyway because Windows XP x64 doesn't support 16 bit code.
    The gcc output is in the attachment.
    2.
    Code:
    BWTS bug.exe _
    Sorting the first block is quick, about 2s. After 20 minutes the second block is not sorted yet, CPU usage is constantly 100%.
    Attached Files Attached Files

  19. #19
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Look I am no expert on what your doing. You talk about more than one block I did it in one block. I currently use the DJGPP version of C for my code. It may not work on your machine. I can't debug your machine unless your here with me. I suspect others could help if they wanted to. But you can assign less memory to the blocks and see if it runs. All I can say is it works for me. Try using the OPENBWT library Yuta Mori wrote that use his BWTS. Just remember his sample code stores the block size in front of file so its not fully bijective unless you slightly mode his code. But its there for anyone to use.

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    @biject
    What he said basically means that your bwts source is not portable to 64-bit systems.

  21. #21
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by biject.bwts View Post
    Look I am no expert on what your doing. You talk about more than one block I did it in one block. I currently use the DJGPP version of C for my code. It may not work on your machine. I can't debug your machine unless your here with me. I suspect others could help if they wanted to. But you can assign less memory to the blocks and see if it runs. All I can say is it works for me. Try using the OPENBWT library Yuta Mori wrote that use his BWTS. Just remember his sample code stores the block size in front of file so its not fully bijective unless you slightly mode his code. But its there for anyone to use.

    I tried 20k block, just like the 16bit version and it works. Then I tried to manipulate the size. 250K is slow 300K is VERY slow, though I didn't measure the time. I guess it has something to do with complexity.
    I have to say that I don't understand your code, but I eliminated 1 possible culprit: qsort and used std::sort, which is pessimistic n lg n instead. It is even slower (still working ).

    Quote Originally Posted by Shelwien View Post
    @biject
    What he said basically means that your bwts source is not portable to 64-bit systems.
    No, it is portable. But it has a problem that doesn't appear in 16bit code because MAX_INT is too small for it to matter.
    Last edited by m^2; 6th November 2008 at 00:10.

  22. #22
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Well I am using an amd 64 bit machine. I have had trouble with code people have given me going to my machine but. I think the GCC complier is using 64 integers during my arb255 code. I am not sure what in BWTS could be giving him a problem its actually pretty straight forward not like a lot of my other code also I suggested he try OPENBWT Yuta code maybe it would work with his machine. He could see how big an integer his machine handles maybe he is using an options where he is trying to use somthing as 16 bits when its really 32 or more bits. In which case it may just be his compiler.

  23. #23
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by biject.bwts View Post
    Well I am using an amd 64 bit machine. I have had trouble with code people have given me going to my machine but. I think the GCC complier is using 64 integers during my arb255 code. I am not sure what in BWTS could be giving him a problem its actually pretty straight forward not like a lot of my other code also I suggested he try OPENBWT Yuta code maybe it would work with his machine. He could see how big an integer his machine handles maybe he is using an options where he is trying to use somthing as 16 bits when its really 32 or more bits. In which case it may just be his compiler.
    There's no 64bit GCC for Windows, I use the 32 bit one.
    I'll try OPENBWT, but not today.

  24. #24
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Look its possible your configuration does not handle large memory very well. Since using large blocks requires a lot of memory. The code is test code its not optimized to be memory conservative. I just want code that works. I suspect if BWTS becomes useful others will write a better way to speed it up like what has been done for standard BWT. It may be possilbe to break it up like in Matts code where the whole block not needed in memory the same time.

  25. #25
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by biject.bwts View Post
    Look its possible your configuration does not handle large memory very well. Since using large blocks requires a lot of memory..
    I checked this already. It needs 5 MB for 300K block.
    Cache misses are a big problem for my CPU (Pentium D), but not that big.
    Did you calculate complexity of this transform?

  26. #26
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    The file that you sent took less than 2 seconds in either direction. So no I did not calculate any times. It was fast on my machine. But you stated it hung up completely for the file you used. I really have no clue as to why it does not finish on yours. When developing it there was tuning that changed the timing on big files. That could make a difference on some machine I think I posted some of that on comp.compression you stated you would try OPENBWT his method is different but should get the same anwsers. I suspect his code is much faster than mine.

  27. #27
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by biject.bwts View Post
    The file that you sent took less than 2 seconds in either direction. So no I did not calculate any times. It was fast on my machine. But you stated it hung up completely for the file you used. I really have no clue as to why it does not finish on yours. When developing it there was tuning that changed the timing on big files. That could make a difference on some machine I think I posted some of that on comp.compression you stated you would try OPENBWT his method is different but should get the same anwsers. I suspect his code is much faster than mine.
    With 20K block size?
    Here it's fast too.
    200K takes 10 seconds.
    300K: 130 seconds.

    UNBWTS is fast, no problems here.

  28. #28
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Ok I see finally what your doing. I tested the file you sent me no problem but when I uncompress what you sent one gets a file which is the one that can be slow depending on which version of BWTS is used. I changed to various block sizes with the code that I currently use. It can be very slow. I could speed it up it has to do with how I parse the blocks. The code works but its slow I suspect you have large streches of repeated zeros or some other constant value I could check for that but I don't maybe in the next version.
    I will save the test file for future use thanks. But it does work. You could use an Rle like Burrows and Wheeler did when the first BWTs came out. Again its slow for the special case but it still works.

  29. #29
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by biject.bwts View Post
    Ok I see finally what your doing. I tested the file you sent me no problem but when I uncompress what you sent one gets a file which is the one that can be slow depending on which version of BWTS is used. I changed to various block sizes with the code that I currently use. It can be very slow. I could speed it up it has to do with how I parse the blocks. The code works but its slow I suspect you have large streches of repeated zeros or some other constant value I could check for that but I don't maybe in the next version.
    I will save the test file for future use thanks. But it does work. You could use an Rle like Burrows and Wheeler did when the first BWTs came out. Again its slow for the special case but it still works.
    OK, good that you figured out.

  30. #30
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    BWTS and UNBWTS are methods where I doubt you will ever get comparable speeds. Unless someone finds a short cut one may be stuck with the fact the forward transform is much slower. I suspect thats good in the sense you only have to compress a file once. But if many users then there can be many decompressions of same file. I think its the same with BWT and UNBWT

    However if you are encrypting I suspect you want the slow function to be on the decryption side which is why I say use UNBWTS as a part of process during encryption and the BWTS which is slower can be used on decryption part where one wants it to be harder to analyse. Sadly ther is not equivalent way to do this with old BWT since fiew files can be UNBWTed while every file can be UNBWTSed

    Saving a file pointers which is not much space was not my only reason for coding BWTS I wanted for years to be able to UNBWTS any file. I think I have found it. Check Yuta if you want more speed.

Similar Threads

  1. NanoZip - a new archiver, using bwt, lz, cm, etc...
    By Sami in forum Data Compression
    Replies: 280
    Last Post: 29th November 2015, 10:46
  2. Alphabet Reordering for BWT
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 31st May 2009, 03:32
  3. BWT with compressed input data
    By Shelwien in forum Data Compression
    Replies: 3
    Last Post: 29th May 2009, 15:16
  4. Modern BWT speed
    By GerryB in forum Data Compression
    Replies: 4
    Last Post: 5th May 2009, 17:28
  5. BWT/ST compression
    By encode in forum Data Compression
    Replies: 60
    Last Post: 25th June 2008, 07:21

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •