Results 1 to 5 of 5

Thread: Question about BWTS

  1. #1
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts

    Question about BWTS

    I have question for David but I want to make it public:

    Are there cases when BWTS could provide significantly better compression that ordinary BWT? I am wondering if BWTS could help in case of eg sorted dictionary. On http://www.maximumcompression.com/data/dict.php BWT based (no STx based or LZP augmented) compressors perform poorly. What is the average number of Lyndon words after factorization?

    Additionally I think that in paper: http://bijective.dogma.net/00yyy.pdf on page 6, Algorithm 3.2 Match(n) there is error in line 8. Currently loop at lines 7-8 does nothing as array before contains only zeroes. I think in line 8 should be: before[c] = before[c - 1] + counts[c - 1]

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > Are there cases when BWTS could provide significantly better compression that ordinary BWT?

    Normally it shouldn't. Afaik the difference is on the level of BWT with and without EOF - ie very small.

    > I am wondering if BWTS could help in case of eg sorted dictionary.

    I did some experiments like that long ago.

    http://compression.ru/sh/dictpack2.rar
    See dk_p2 there - its a BWT with comparison function modified in such a way,
    that it stops on LFs, dk_p2d is an inverse transform for that.

  3. #3
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Thanks.

    I've looked at source code. It's surprisingly simple. I'm amazed that this does work ;p Probably it does use similiar techniques as BWTS presented by David Scott. Too bad I didn't understood the paper yet.

    I see you use preceding contexts. Did you try following contexts? I think preceding contexts should provide better compression but maybe the truth is different.

    What is "english_lf"? Did you convert the file english.dic, eg deleted CR values or something? Won't your programs work with files with CR+LF line endings?

    Why there are two encoders: p1 and p2 but only one decoder p2d? Is only the second one reversible?

    Also could you provide a link to some program that implements some efficient second stage algorithms like Weighted Frequency Coding or something and runs on Linux? Currenlty I'm using fpaq0f2 as second stage algorithm but it's inferior to specialized ones.


    And I have a thought: maybe some custom factorizations, different from those presented in BWTS or dictpack would help on other types of files?

  4. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,348
    Thanks
    212
    Thanked 1,012 Times in 537 Posts
    > I've looked at source code. It's surprisingly simple. I'm amazed
    > that this does work ;p

    For dictionaries there're actually preprocessing methods which are
    more helpful for compression improvement, eg.
    http://nishi.dreamhosters.com/u/dictpack3b.rar

    > Probably it does use similiar techniques as
    > BWTS presented by David Scott.

    I don't think so. Its more like a variable-length ST.

    > Too bad I didn't understood the paper yet.

    Afaik BWTS is all about getting rid of the BWT index,
    ie making the transform completely bijective.
    The standard BWT would produce the same output for all
    string shifts, so the question is basically about how
    to detect which strings can't be a BWT of some data.

    > I see you use preceding contexts. Did you try following contexts? I
    > think preceding contexts should provide better compression but maybe
    > the truth is different.

    There's a "turn" utility in dictpack3b which you can use for testing.
    Also see a simple BWT+postcoder in http://ctxmodel.net/files/mix_test/mix_test_v9.rar

    > What is "english_lf"? Did you convert the file english.dic, eg
    > deleted CR values or something?

    Yes. Some of more complex filters in dictpack experiments required that.

    > Won't your programs work with files with CR+LF line endings?

    dk_p2 should work I think.

    > Why there are two encoders: p1 and p2 but only one decoder p2d?
    > Is only the second one reversible?

    I think p1 is just normal BWT.
    There was an alternative implementation, where unique symbol combinations
    were added after LF, so that normal BWT could produce the same output
    as modified one.
    I guess I just didn't bother to provide an unBWT for it.

    > Also could you provide a link to some program that implements some
    > efficient second stage algorithms like Weighted Frequency Coding or
    > something and runs on Linux?

    I'd suggest mixtest_v9, its probably the best postcoder atm
    (compression-wise; its not very fast though)
    Unfortunately gcc has some template issues, so you probably
    won't be able to easily compile it.
    But there's a "fixed" BWTmix:
    http://ctxmodel.net/files/mix_test/B..._gcc_patch.rar
    which you can use instead - just comment out the qsort line there
    and it would be a plain postprocessor.

    Alternatively I can suggest my qlfc rip
    (seems that v2 won't work due to the same template issues)
    http://nishi.dreamhosters.com/u/bsc240_qlfc_v1.rar

    Also afair openbwt included some postcoders:
    http://encode.su/threads/104-libBWT?...ll=1#post22903

    > And I have a thought: maybe some custom factorizations, different
    > from those presented in BWTS or dictpack would help on other types
    > of files?

    Likely only for files with somehow sorted contents.
    But there're lots of preprocessing types which would improve results
    of any compression methods.

  5. #5
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    I have question for David but I want to make it public:

    Are there cases when BWTS could provide significantly better compression that ordinary BWT? I am wondering if BWTS could help in case of eg sorted dictionary. On http://www.maximumcompression.com/data/dict.php BWT based (no STx based or LZP augmented) compressors perform poorly. What is the average number of Lyndon words after factorization?

    Additionally I think that in paper: http://bijective.dogma.net/00yyy.pdf on page 6, Algorithm 3.2 Match(n) there is error in line 8. Currently loop at lines 7-8 does nothing as array before contains only zeroes. I think in line 8 should be: before[c] = before[c - 1] + counts[c - 1]
    I am sure there are cases where BWTS can provide significantly better compression than ordinary BWT but that depends on what you mean by the word "significantly". If one has many short files or data sets it might work better there.
    As far as the paper goes. I think yes there are mistakes and Gil was aware of them I don't write. However was under the impression Gil would iron those out after it passed so sort of review during the acceptance of the paper. But to me it was obvious the reviews didn't see the weaknesses in the paper that needed fixing. Gil gave me the impression that it's some sort of game to get a paper in.
    I think people who rejected it didn't have a clue what it does and that some thought BWT already bijective if you add and EOF symbol. I don't trust papers I think they are mostly for the elite to pat them selves on the back. I had little to do with the actual writting. The concept was mine the paper write up his. As for fake papers just look at the nonsense of the phony global warming hype and yes I did work for NASA which it one time I was proud now I am not so proud of that fact.
    However Yuta in my mind was a very great help in that he made the linear time and linear space coding for the algorithm.

    It was Mark Nelson that led me to come up with this. I am more interested in encryption than compression but since encryption is so controlled by the government and its hard to prove results I focused on compression. It's seemed obvious to my that since its generally best to compress before one encrypts BWT compression was already the best as far as Unicity Distance considerations go. I had a gut feel the index was not needed for BWT and that it would make compression for encryption better.
    It was not really hard to make bijective transforms but it took some time to make one that seemed reasonable. For the 256 case which is what most coded is posted for I think you see the best algorithm. For the binary case I can do better than just reducing it to 2 symbols and then using the old method. That is there a method that is a better bijective transform than the pure BWTS.

    I plan to write at least one more compression program using a binary bijective modified BWTS transfrom. The goal will not be speed. I hope it compresses well but the goal will
    be to make a compression program that allows the Unicity Distance to be greatly increased if one use any simple block encryption method on the result and the enemy knows excatly what you have done and has your code and the ciphertext as his only means of attack. Of course this doesn't rule out chosen plaintext attacks. But if your enemy can chose your messages then you most likely have far greater problems.

    Another piece of code I might write is an internal variable keyed all or nothing
    transform using these methods where again no secret key is used since the key
    part of the message. But the message will be a single block see

    http://en.wikipedia.org/wiki/All-or-nothing_transform

    In anwser to you question. BWTS most likely to lead to same compression
    and a BWT compression with only a few bytes saved sometimes smaller sometimes not but it will be slower. Actaully for many files when you look at the data stream there is little difference between the data from a BWT or BWTS and the fact is there are many ways to do a BWT so that even BWT data for the same file me be different since there seems to really be no standard for how the data stream should look. Again the differences usually small.

Similar Threads

  1. BIJECTIVE BWTS-DC-FIB
    By biject.bwts in forum Data Compression
    Replies: 4
    Last Post: 13th February 2012, 02:36
  2. USING BWTS in BWTmix
    By biject.bwts in forum Data Compression
    Replies: 8
    Last Post: 9th March 2010, 03:00
  3. BWTS STATUS OF PAPER
    By biject.bwts in forum Data Compression
    Replies: 20
    Last Post: 4th September 2009, 22:10
  4. BWTS explanation?
    By TopQuark in forum Data Compression
    Replies: 5
    Last Post: 8th April 2009, 23:26
  5. NEW FULL BWTS COMPRESSOR
    By biject.bwts in forum Data Compression
    Replies: 2
    Last Post: 20th January 2009, 21:10

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •