Page 1 of 2 12 LastLast
Results 1 to 30 of 35

Thread: SCOTT TRANSFORM

  1. #1
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Talking SCOTT TRANSFORM

    I am calling it the Scott Transform since its based on a BWTS and few understand BWTS which is necessary for the understanding of this transform. So its not likely in current use.
    Will write a 4 stream entropy coder later

    Basically it is binary bijective transform that is LENGTH PRESERVING and instead of getting your data in terms of long runs of zeroes and ones you get a unary numbers for the runs that is 1111 or 0000 becomes 0001 you also have the unary numbers in such a way that you could compress the whole series of numbers as one stream. Or you could compress each alternate where you would be compressing the underlying runs of ones and zeroes as separate number streams. Or lastly you can look at the data as 4 separate streams based on what the First and Last column sort to. You need only do a forward pass of the data for the entropy coder and the streams break apart in a natural way. I will hopefully post a good 4 stream entropy coder as a separate program

    here is debug output of using it on a file that is "BANANAS"




    C:\>scott_transD f bananas.x x.x
    Doing the forward scott transform
    Version 20110802
    01000010 01000001 01001110 01000001 01001110 01000001 01010011
    input file total = 56 group = 30 zeros = 36 ones = 20

    11110000 00011100 00111000 01111000 10001000 11000000 00001100
    normal BWTS total = 56 group = 16 zeros = 36 ones = 20

    11100000 00111010 00111000 00111100 00001000 11000000 00011100
    modied BWTS total = 56 group = 16 zeros = 36 ones = 20

    THE 4 stream representation
    IIIIOOOi iiOOOOOO OOOiiooo ooooIIIo IoooIIIo ooooIIII OOOioooo
    00010010 01000000 00101000 00010011 10010010 00010001 00110000
    total = 56 group = 25 zeros = 41 ones = 15



    Like I said its bijective and length preserving please take a look at it.
    Attached Files Attached Files

  2. Thanks:

    R2F6K (7th February 2014)

  3. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    So this is a bitwise BWTS followed by a delta transform (or binary MTF)? Interesting.

    768,771 book1
    442,501 book1.fpaq0p
    768,771 book1.scott_trans
    248,970 book1.scott_trans.fpaq0p

    513,216 pic
    67,909 pic.fpaq0p
    513,216 pic.scott_trans
    58,928 pic.scott_trans.fpaq0p

  4. #3
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Red face

    Well yes you could say that. However I was talking to <Shelwien> and he wishes me to explain more just what it is. So I am going to try in this post.
    If you look at http://bijective.dogma.net/compresa.htm#iBBW you see my code (sadly its from my DOS days and none of the executables likely to run on most machines. Also I use a different C compiler MingGW now so they may not compile straight) versus Nagy's. for the binary one zero thing. I really don't have a clue what he did. But I suspect he was just compressing a single string of numbers. On the page you can see I did a BWTS and a MTFQ and wrote a special entropy coder based on two streams. Well I really don't like what I did. First of all I converted to a string of zeroes and ones. This string is often not a multiple
    of 8 since I have a way of mapping all 1 0 string to bytes and bytes are a multiple of 8 bits but a string can be any. Second the MTFQ will map to two sets of unary numbers depending on the first bit. it will use either 11110 or 00001 for runs not both. Lastly I am working in Binary so why use a BWTS that use lyndon words that could be all zeroes or ones. I wish to use only lyndon words that start with zero and end in one. Like wise I want all runs to by of form 00001 and no single runs of the form 11110 that really bothered me. SO now fast forward to this
    transform


    I also want to have the choice of either looking at a single stream of numbers or two streams of numbers where one is of the zeros and the other is of the ones. And what the hell make it so I can use four streams of numbers see below. Lastly I don't want the length to change with the transform so here it is.


    First I temporarily add a zero in front of string and a one in back this make the
    string 8*n+2 bits long. Then I do the standard binary BWTS in debug mode I
    show what that is. Note the BWTS will only have lyndon words that start with zero
    and end in one. Also of prime importance is that in the 8*n+2 state there are
    exactly as may runs of ones as of zeros. This means the total number of runs is always even. This fact will be use later.


    When you do the transform you magically have a one in front and a zero at
    end since I add a zero and one and I now take off a zero and one I am back to the modifed BWTS note that total number of ones and zeros the same.

    The next fact is that when you look at the First column you have only two runs. A run of zeros followed by a run of ones. Also notice that in the Last column ever where you have this condition FL being 01 there is a matching FL of 10 from the
    bottom.

    The next trick is to realize that I can use one in front and the zero in back of the
    8*n+2 bits and still keep the same size if I drop two other bits somewhere in side the data set.

    Since I know it starts with a 1 I encode the run of ones at front as 1 to 1 or 111 to 001. When that is done I go to back end of file notice I have a run of zeros I encode it as 000 to 001 and this is add to output I then continue on back side and encode the set of ones. I keep doing this until I have seen enough zeros to account for the ones that occured in front. I then switch to reading from the front of file again until more ones on top than zeroes on bottom. I do this until I have exhausted 8*n bits.
    that leaves 2 bits somewhere in the center of file that are dropped.

    When you do the reverse transform you build the 8*n+2 thing.
    You can directly compute all the bits from the data. The trick is
    calculating the two bits dropped. One way is to look at how the
    unary file ended it as either 0001 or 000 and rember in unary mode
    there has to be an even number of runs so if odd so far
    0001[01] are the missing bits. Note in the 8*n+2 mode the last bit
    end of unary number so its always a one. Then the bit next to it
    is either a one or zero. What ever makes the total number of runs
    in the 8*n+2 thing an even number. Hope that helps
    running the D version shows the for separate streams in the output
    I is starting run of one from front O is starting run of zeros from back
    i is run of ones next to O from the back. o is run of zeroes from the front next to I.

    I choose the order to make it easy to use any of one two or four streams. One could really make this transform much more complicated but why do that unless your using a highly modified version for some sort of encryption.

    One last point you really have several transforms here. I started at front of file using BWTS. You could start at back. Also since a string you could scan the byte in the opposite direction. You could also change the ones to zeroes and zeroes to ones I have not looked at that but there are dozens of modes. You could customize it based on the data structure of the files in question if they have a common form

    Ok I hope this was good enough. I reread it several times. Sadly I see what I want to see and not what is there.

  5. #4
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    So this is a bitwise BWTS followed by a delta transform (or binary MTF)? Interesting.

    768,771 book1
    442,501 book1.fpaq0p
    768,771 book1.scott_trans
    248,970 book1.scott_trans.fpaq0p

    513,216 pic
    67,909 pic.fpaq0p
    513,216 pic.scott_trans
    58,928 pic.scott_trans.fpaq0p

    Matt I am glad you looked at it. I am glad it makes book1 one compress better than fpaq0p by itself. I don't know how many symbols types are in book1 but I am guessing less than 128. I would be surprised if the scott transfrom which is actually for bitstream compression created less that 250 symbols or maybe even 255. The fact that it compresses at all after this transform means that the symbols that occur the most after this transform are rich in zeroes. like 00000000 may occur the most and those with only one or two bits set to one. Yet I don't think it would help in higher order byte compressions since its unlikely that bytes are related to each other in a simple way. I think you PAQ compress would compress worse after this transform since its byte orientated and its only going to confuse it.

  6. #5
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Probably. An adaptive order 0 model like fpaq0p is better suited for BWT and I suppose binary BWTS. Anyway I included pic because it is bit oriented data (MSB first) rather than byte oriented.

  7. #6
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    no LTCB results yet then?

  8. #7
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Unhappy

    Quote Originally Posted by willvarfar View Post
    no LTCB results yet then?
    If you mean Large Text Compression Benchmark you have to realize the transform as written only use 1,000,000 bytes so you would be working on a 1,000 separate blocks. I did write a 64bit exe that does enwik9 for the 256 symbol case as one block it worked on my pc which has 6GB of memory. So it would not be hard to write a 64bit exe to handle enwik8 in one block. So you could have a version that handles enwik9 with 10 blocks. I am playing around with a 4 stream entropy coder at this point and only looking at files of 1,000,000 bytes or less.
    Besides if you use one of the better programs of the LTCB they would compress worse since they are more byte orientated than bit. Also there code would not be aware of the 4 separate streams so that info would not be used causing worse compression. At least that is my gut instinct belief. But I would be happier if I was wrong.

  9. #8
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    It's a transform, not a compressor, so no LTCB benchmark yet. It still needs a back end model and coder.

  10. #9
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Yes its a transform not a compressor it changes the length by ZERO

    however using the worst compressor from the LTCB namely arb2x
    which uses a single stream and has one cell that counts only zero bits and one bits
    you get below. Added this to Matt test and used my bijective 256 symbol coder arb255


    768,771 book1
    763,152 book1.arb2x
    435,125 book1.arb255
    442,501 book1.fpaq0p
    768,771 book1.scott_trans
    248,970 book1.scott_trans.fpaq0p
    367,326 book1.scott_trans.arb2x
    279,327 book1.scott_trans.arb255

    Not sure if I have the MinGW versions of arb.. out there but
    will update in a few days and include the 2 stream and 4 stream
    versions of arb2x for a quick test of the 4 streams.

    I am really surprised that fpaq0p beat arb255 in this case.
    arb255 is a true optimal stationary arithmetic compressor.
    I am not surprised it beat it on the file book1 but the fact
    it beat it on book1.scott_trans by a larger margin is a
    mystery to me.

    By that I mean take any byte permutation of the file and arb255
    will compress to same size length but could span a byte so
    could get n and n+1 length for some files. I would be shocked
    if there was no permutation of book1.scott_transform where
    fpaq0p does not do worse than arb255. I think it's a tuned
    nonstationary arithmetic compressor. Why this transform
    which in my mind is a spread spectrum sort of thing would
    allow this to occur is very interesting. Since its not the sort
    of file that the code was tuned for. Oh well it makes life
    interesting.

  11. #10
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by biject.bwts View Post
    I am glad it makes book1 one compress better than fpaq0p by itself. I don't know how many symbols types are in book1 but I am guessing less than 128. I would be surprised if the scott transfrom which is actually for bitstream compression created less that 250 symbols or maybe even 255. The fact that it compresses at all after this transform means that the symbols that occur the most after this transform are rich in zeroes. like 00000000 may occur the most and those with only one or two bits set to one.
    book1:
    82 different bytes
    entropy: 4.52
    55.03% of zeroes / 44.97% of ones

    book1.scott:
    256 different bytes (550244 times 0x00 out of 768771, that's 71.57%)
    entropy: 2.90
    89.72% of zeroes / 10.28% of ones

  12. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    For BWT you need a fast adapting model, not a stationary one. Here are a few more experiments.

    Code:
    768,771 book1
    768,771 book1.scott_trans
    
    231,899 book1-m1.zpaq
    250,286 book1.scott_trans-1.zpaq
    252,387 book1.scott_trans.fpaq0f2
    
    215,277 book1-m2.zpaq
    237,694 book1.scott_trans-2.zpaq
    
    277,515 book1.scott_trans.fpaq0
    262,988 book1.scott_trans-cm1-255.zpaq
    
    248,970 book1.scott_trans.fpaq0p
    249,951 book1.scott_trans-cm1-8.zpaq
    book1-m1.zpaq is compressed with zpaq -m1. It uses BWT+RLE+ICM0. scott_trans-1 takes out the BWT pre/post but uses the same model. The config file is:
    Code:
    comp 1 0 0 0 1
      0 icm 5
    hcomp
      halt
    post 0 end
    fpaq0f2 also uses an indirect model like this but with a different bit history.

    zpaq -m2 uses BWT+ICM0-ISSE1 (no RLE). Again I replaced the BWT with scott_trans.
    Code:
    comp 1 0 0 0 2
      0 icm 5
      1 isse 12 0
    hcomp
      d= 1 *d=0 hashd halt
    post 0 end
    It uses an order 0 ICM like before, but then adjusts the prediction using an order 1 ISSE. An ISSE is a 2 input mixer taking the ICM prediction and a fixed constant as input and the weights are selected by an order 1 bit history.

    fpaq0 is a stationary order 0 model. -cm1-255.zpaq is similar:
    Code:
    comp 1 0 0 0 1
      0 cm 9 255
    hcomp
      halt
    post 0 end
    zpaq does not have a pure stationary model but this is pretty close. A CM maps a context directly to a prediction. The context is order 0, thus HCOMP is empty. But there is an implied bitwise context expanded to 9 bits to reduce cache misses. The arguments to the CM are the context size in bits and 1/4 the maximum count. The learning rate is 1/count where count is limited to 255*4, which is the maximum supported.

    fpaq0p is an adaptive direct context model with a learning rate of 1/32. This is equivalent to "CM 9 8", replacing line 0 above in cm1-8.zpaq above except that the learning rate is faster until the count reaches 32.

  13. #12
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Matt when you say you replaced BWT with the scott_trans. Are you replacing a BWT that sorts bytes with one that sorts bits. Don't really expect to much gain if code was designed for bytes. Here is code using lzw1_m32 Its a simple bijective 2 Root LZW where the compression changes very slowly since each pass replaces a string of bits with 2 strings that are one bit longer. This is the MinGW version of LZW1 but it's designed to work with a string of bits. I think ones need a special string compressor that favors zero over ones out the gate to get really small values and this compressor does not do that but its interesting.

    768,771 book1
    768,771 book1.scott_trans
    491,700 book1.lzw1_m32
    244,769 book1.scott_trans.lzw1_m32

    file bananas 7 characters "BANANAS"
    7 bananas.scott_trans
    9 bananas.lzw1_32
    6 bananas.scott_trans.lzw1_m32

  14. #13
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by biject.bwts View Post
    Matt when you say you replaced BWT with the scott_trans. Are you replacing a BWT that sorts bytes with one that sorts bits. Don't really expect to much gain if code was designed for bytes. Here is code using lzw1_m32 Its a simple bijective 2 Root LZW where the compression changes very slowly since each pass replaces a string of bits with 2 strings that are one bit longer. This is the MinGW version of LZW1 but it's designed to work with a string of bits. I think ones need a special string compressor that favors zero over ones out the gate to get really small values and this compressor does not do that but its interesting.

    768,771 book1
    768,771 book1.scott_trans
    491,700 book1.lzw1_m32
    244,769 book1.scott_trans.lzw1_m32

    file bananas 7 characters "BANANAS"
    7 bananas.scott_trans
    9 bananas.lzw1_32
    6 bananas.scott_trans.lzw1_m32
    MY MISTAKE I SHOULD CUT AND PASTE
    8 bananas.lzw1_m32 meant to add one to length and forgot m before 32 I think I will have another fine German Beer and call it a nite my eyes are tired.

  15. #14
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by biject.bwts View Post
    lzw1_m32 Its a simple bijective 2 Root LZW where the compression changes very slowly since each pass replaces a string of bits with 2 strings that are one bit longer.
    Shouldn't that be shorter rather than longer?

  16. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I'm using byte oriented BWT so I expect that to be better than bit oriented.

  17. #16
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by caveman View Post
    Shouldn't that be shorter rather than longer?
    Well it depends on how you think of it. Here is another way to think about it. At start you have two strings "0 and "1" suppose the file being compressed is 01.... than you output a token either either a one or zero you then create two new strings so you have 3 strings 01 00 and 1 you toss the 0 string in the trash bin you never use it again. Example suppose you have 8 strings your looking at 000 001 010 011 100 101 110 111 you have 8 strings and you need to output a string that represents which string was used you need exactly 3 bits so when this is the table you get no compression. But suppose the input file is a string of zeroes then you eventual get this working set of strings
    1 01 001 0001 00001 000001 0000001 0000000 in this case you still need to output 3 bits for any string if the input all zeroes you out 3 bits for the current string of 7 zeros then you trash that string and create 2 new longer strings one with 8 zeroes and the next with 7 zeroes followed by a 1. In fact you actually save output space when you have this set most of the time you expand by 2 bits if input was a 1 expand by 1 bit if 01 you break even if input 001 on everything else you are shortening the output file.

    Its that simple hope this helps

    The code for LZW1 was very simple. It can be made to compress this kind of rich zero file better. The reason is I used two level huffman which is not as good as arithmetic. Example suppose the there are 6 strings in tree I use the two level huffman 2 tokens get written in 2 bits and 4 get written with 3 bits. If either string equally likely then the average length in bits out is (2*2 +4*3)/6 = 2.66666667 bits out on average if you used arithmetic the average would be ln(6)/ln(2) = 2.5849625 so clearly in general case arithmetic better. However if one tuning LZW1 for a rich zero environment you should have the strings arranged and sorted such that those ending in 0 use the 2 bits and those that end in 1 use the 3 bits. This would greatly help in compression when preceded by scott_trans if I get the time I will mod LZW1 for a version tuned for just the 0 rich case.

  18. #17
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Matt Mahoney View Post
    I'm using byte oriented BWT so I expect that to be better than bit oriented.
    Matt your code tuned with byte oriented BWT and if you substituted BWTS directly and then dropped the coding of the INDEX it would likely compress much better than using the bit oriented Scott_trans. I think your original BWT would still beat on the average byte BWTS but not by much since they are almost the same. But your stuff was tuned using the BWT that you use.
    As for the Scott_trans it kind of says you can compress only if rich in zeros. Therefore the after stages would have to assume a rich zero input of zeros. Just like I am sure I can make an LZW1 better if I change it to expect more zeroes it would compress highly redundant files better. Of course since any output is possible it being a bijective transform you could have a rich one output file. But I think it would come from a random looking file before the transfrom. Most random input files would produce random output files.

  19. #18
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by biject.bwts View Post
    Well it depends on how you think of it. Here is another way to think about it. At start you have two strings "0 and "1" suppose the file being compressed is 01.... than you output a token either either a one or zero you then create two new strings so you have 3 strings 01 00 and 1 you toss the 0 string in the trash bin you never use it again. Example suppose you have 8 strings your looking at 000 001 010 011 100 101 110 111 you have 8 strings and you need to output a string that represents which string was used you need exactly 3 bits so when this is the table you get no compression. But suppose the input file is a string of zeroes then you eventual get this working set of strings
    1 01 001 0001 00001 000001 0000001 0000000 in this case you still need to output 3 bits for any string if the input all zeroes you out 3 bits for the current string of 7 zeros then you trash that string and create 2 new longer strings one with 8 zeroes and the next with 7 zeroes followed by a 1. In fact you actually save output space when you have this set most of the time you expand by 2 bits if input was a 1 expand by 1 bit if 01 you break even if input 001 on everything else you are shortening the output file.

    Its that simple hope this helps
    I get it, it's a bit like dynamic Tunstall codes (variable length to fixed length codes), for instance if you have 0.653683% of zeros and 0.346317% of ones, 8 bit Tunstall codes look like these (codes [0-55] expand the output, codes [56-146] break even, codes [147-255] save between 1 and 4 bits), Tunstall codes themselves could be output using an entropy coder since their probability of occurrence are skewed (this would result in a variable length to variable length coder).
    Code:
     Tun  Size Probability Coded bits
       0    5   0.0049816  11111
       1    6   0.0032564  111101
       2    6   0.0032564  111011
       3    6   0.0032564  110111
       4    6   0.0032564  101111
       5    6   0.0032564  011111
       6    7   0.0021286  1111001
       7    7   0.0040179  1111000
       8    7   0.0021286  1110101
       9    7   0.0040179  1110100
      10    7   0.0021286  1110011
      11    7   0.0040179  1110010
      12    7   0.0040179  1110001
      13    7   0.0021286  1101101
      14    7   0.0040179  1101100
      15    7   0.0021286  1101011
      16    7   0.0040179  1101010
      17    7   0.0040179  1101001
      18    7   0.0021286  1100111
      19    7   0.0040179  1100110
      20    7   0.0040179  1100101
      21    7   0.0040179  1100011
      22    7   0.0021286  1011101
      23    7   0.0040179  1011100
      24    7   0.0021286  1011011
      25    7   0.0040179  1011010
      26    7   0.0040179  1011001
      27    7   0.0021286  1010111
      28    7   0.0040179  1010110
      29    7   0.0040179  1010101
      30    7   0.0040179  1010011
      31    7   0.0021286  1001111
      32    7   0.0040179  1001110
      33    7   0.0040179  1001101
      34    7   0.0040179  1001011
      35    7   0.0040179  1000111
      36    7   0.0021286  0111101
      37    7   0.0040179  0111100
      38    7   0.0021286  0111011
      39    7   0.0040179  0111010
      40    7   0.0040179  0111001
      41    7   0.0021286  0110111
      42    7   0.0040179  0110110
      43    7   0.0040179  0110101
      44    7   0.0040179  0110011
      45    7   0.0021286  0101111
      46    7   0.0040179  0101110
      47    7   0.0040179  0101101
      48    7   0.0040179  0101011
      49    7   0.0040179  0100111
      50    7   0.0021286  0011111
      51    7   0.0040179  0011110
      52    7   0.0040179  0011101
      53    7   0.0040179  0011011
      54    7   0.0040179  0010111
      55    7   0.0040179  0001111
      56    8   0.0026264  11100001
      57    8   0.0049574  11100000
      58    8   0.0026264  11010001
      59    8   0.0049574  11010000
      60    8   0.0026264  11001001
      61    8   0.0049574  11001000
      62    8   0.0026264  11000101
      63    8   0.0049574  11000100
      64    8   0.0026264  11000011
      65    8   0.0049574  11000010
      66    8   0.0049574  11000001
      67    8   0.0026264  10110001
      68    8   0.0049574  10110000
      69    8   0.0026264  10101001
      70    8   0.0049574  10101000
      71    8   0.0026264  10100101
      72    8   0.0049574  10100100
      73    8   0.0026264  10100011
      74    8   0.0049574  10100010
      75    8   0.0049574  10100001
      76    8   0.0026264  10011001
      77    8   0.0049574  10011000
      78    8   0.0026264  10010101
      79    8   0.0049574  10010100
      80    8   0.0026264  10010011
      81    8   0.0049574  10010010
      82    8   0.0049574  10010001
      83    8   0.0026264  10001101
      84    8   0.0049574  10001100
      85    8   0.0026264  10001011
      86    8   0.0049574  10001010
      87    8   0.0049574  10001001
      88    8   0.0026264  10000111
      89    8   0.0049574  10000110
      90    8   0.0049574  10000101
      91    8   0.0049574  10000011
      92    8   0.0026264  01110001
      93    8   0.0049574  01110000
      94    8   0.0026264  01101001
      95    8   0.0049574  01101000
      96    8   0.0026264  01100101
      97    8   0.0049574  01100100
      98    8   0.0026264  01100011
      99    8   0.0049574  01100010
     100    8   0.0049574  01100001
     101    8   0.0026264  01011001
     102    8   0.0049574  01011000
     103    8   0.0026264  01010101
     104    8   0.0049574  01010100
     105    8   0.0026264  01010011
     106    8   0.0049574  01010010
     107    8   0.0049574  01010001
     108    8   0.0026264  01001101
     109    8   0.0049574  01001100
     110    8   0.0026264  01001011
     111    8   0.0049574  01001010
     112    8   0.0049574  01001001
     113    8   0.0026264  01000111
     114    8   0.0049574  01000110
     115    8   0.0049574  01000101
     116    8   0.0049574  01000011
     117    8   0.0026264  00111001
     118    8   0.0049574  00111000
     119    8   0.0026264  00110101
     120    8   0.0049574  00110100
     121    8   0.0026264  00110011
     122    8   0.0049574  00110010
     123    8   0.0049574  00110001
     124    8   0.0026264  00101101
     125    8   0.0049574  00101100
     126    8   0.0026264  00101011
     127    8   0.0049574  00101010
     128    8   0.0049574  00101001
     129    8   0.0026264  00100111
     130    8   0.0049574  00100110
     131    8   0.0049574  00100101
     132    8   0.0049574  00100011
     133    8   0.0026264  00011101
     134    8   0.0049574  00011100
     135    8   0.0026264  00011011
     136    8   0.0049574  00011010
     137    8   0.0049574  00011001
     138    8   0.0026264  00010111
     139    8   0.0049574  00010110
     140    8   0.0049574  00010101
     141    8   0.0049574  00010011
     142    8   0.0026264  00001111
     143    8   0.0049574  00001110
     144    8   0.0049574  00001101
     145    8   0.0049574  00001011
     146    8   0.0049574  00000111
     147    9   0.0032406  110000001
     148    9   0.0032406  101000001
     149    9   0.0061167  101000000
     150    9   0.0032406  100100001
     151    9   0.0061167  100100000
     152    9   0.0032406  100010001
     153    9   0.0061167  100010000
     154    9   0.0032406  100001001
     155    9   0.0061167  100001000
     156    9   0.0032406  100000101
     157    9   0.0061167  100000100
     158    9   0.0032406  100000011
     159    9   0.0032406  011000001
     160    9   0.0061167  011000000
     161    9   0.0032406  010100001
     162    9   0.0061167  010100000
     163    9   0.0032406  010010001
     164    9   0.0061167  010010000
     165    9   0.0032406  010001001
     166    9   0.0061167  010001000
     167    9   0.0032406  010000101
     168    9   0.0061167  010000100
     169    9   0.0032406  010000011
     170    9   0.0032406  001100001
     171    9   0.0061167  001100000
     172    9   0.0032406  001010001
     173    9   0.0061167  001010000
     174    9   0.0032406  001001001
     175    9   0.0061167  001001000
     176    9   0.0032406  001000101
     177    9   0.0061167  001000100
     178    9   0.0032406  001000011
     179    9   0.0032406  000110001
     180    9   0.0061167  000110000
     181    9   0.0032406  000101001
     182    9   0.0061167  000101000
     183    9   0.0032406  000100101
     184    9   0.0061167  000100100
     185    9   0.0032406  000100011
     186    9   0.0032406  000011001
     187    9   0.0061167  000011000
     188    9   0.0032406  000010101
     189    9   0.0061167  000010100
     190    9   0.0032406  000010011
     191    9   0.0061167  000010010
     192    9   0.0061167  000010001
     193    9   0.0032406  000001101
     194    9   0.0061167  000001100
     195    9   0.0032406  000001011
     196    9   0.0032406  000000111
     197    9   0.0061167  000000011
     198   10   0.0021183  1100000001
     199   10   0.0039984  1100000000
     200   10   0.0021183  1000000101
     201   10   0.0039984  1000000100
     202   10   0.0021183  1000000011
     203   10   0.0039984  1000000010
     204   10   0.0039984  1000000001
     205   10   0.0021183  0100000101
     206   10   0.0039984  0100000100
     207   10   0.0021183  0100000011
     208   10   0.0039984  0100000010
     209   10   0.0039984  0100000001
     210   10   0.0021183  0010000101
     211   10   0.0039984  0010000100
     212   10   0.0021183  0010000011
     213   10   0.0039984  0010000010
     214   10   0.0039984  0010000001
     215   10   0.0021183  0001000101
     216   10   0.0039984  0001000100
     217   10   0.0021183  0001000011
     218   10   0.0039984  0001000010
     219   10   0.0039984  0001000001
     220   10   0.0039984  0000100001
     221   10   0.0021183  0000010101
     222   10   0.0039984  0000010100
     223   10   0.0021183  0000010011
     224   10   0.0039984  0000010010
     225   10   0.0039984  0000010001
     226   10   0.0021183  0000001101
     227   10   0.0039984  0000001100
     228   10   0.0021183  0000001011
     229   10   0.0039984  0000001010
     230   10   0.0039984  0000001001
     231   10   0.0039984  0000000101
     232   10   0.0039984  0000000011
     233   11   0.0026137  10000000001
     234   11   0.0049334  10000000000
     235   11   0.0026137  01000000001
     236   11   0.0049334  01000000000
     237   11   0.0026137  00100000001
     238   11   0.0049334  00100000000
     239   11   0.0026137  00010000001
     240   11   0.0049334  00010000000
     241   11   0.0026137  00001000001
     242   11   0.0049334  00001000000
     243   11   0.0026137  00000100001
     244   11   0.0049334  00000100000
     245   11   0.0026137  00000010001
     246   11   0.0049334  00000010000
     247   11   0.0026137  00000001001
     248   11   0.0049334  00000001000
     249   11   0.0026137  00000000101
     250   11   0.0049334  00000000100
     251   11   0.0026137  00000000011
     252   11   0.0049334  00000000010
     253   11   0.0049334  00000000001
     254   12   0.0032249  000000000001
     255   12   0.0060871  000000000000
    Last edited by caveman; 5th August 2011 at 05:16.

  20. #19
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by caveman View Post
    I get it, it's a bit like dynamic Tunstall codes (variable length to fixed length codes), for instance if you have 0.653683% of zeros and 0.346317% of ones, 8 bit Tunstall codes look like these (codes [0-55] expand the output, codes [56-146] break even, codes [147-255] save between 1 and 4 bits), Tunstall codes themselves could be output using an entropy coder since their probability of occurrence are skewed (this would result in a variable length to variable length coder).
    Code:
     Tun  Size Probability Coded bits
       0    5   0.0049816  11111
       1    6   0.0032564  111101
       2    6   0.0032564  111011
       3    6   0.0032564  110111
       4    6   0.0032564  101111
       5    6   0.0032564  011111
      ......
     252   11   0.0049334  00000000010
     253   11   0.0049334  00000000001
     254   12   0.0032249  000000000001
     255   12   0.0060871  000000000000
    Well I see what you mean but no it's not the same. For example with your table I guess you are assuming input in single bytes. So that that you have 8 bits in and for output you have this fixed table. where you output 5 to 12 bits output. Unless you look at ratio of bits seen so far and then make a new table. But that would be a lot of work.

    Here is another example for LZW1 2 root
    You look at file as one big string. The first time you have 2 input strings and two output strings each one bit long. So in effect you writing 1 of two possible strings. Then your read what ever amount of input needed to go end of tree. Then you output 1 of three. Since using bits you output either a single bit for 1 of the 3 then for each of the other you
    output 2 bits. You create a new table and then output for strings each 2 bits longer.

    you could think of the outputs as (1 of 2 ) then (1 of 3) then (1 of 4) and so on till file ends. so in looking at number of bits out in first 3 token you have either 4 bits out or 5 bits out.
    However on input you scan first one bit then output a token then scan either one or two bits and output a token then scan either 1 up to 3 an output third token

    example 000000 if that is input here is input output tree
    input output
    0 0
    1 1

    next tree
    input output
    00 0
    01 01
    1 11

    next tree
    input output
    000 00
    001 01
    01 10
    1 11

    so 0 00 000 goes to 0 0 00 you save 2 bits.

    next example 0110
    first tree
    input output
    0 0
    1 1
    next tree
    input output
    00 0
    01 01
    1 11
    next tree
    00 00
    01 01
    10 10
    11 11

    so 0 1 10 becomes 0 11 10 so with this sequence you get one bit longer.

    The end of input and end of output are handled by my bijective bit I/O since treating the file as a string which by the way is not usually 8*n bits. But look at code to see this
    the bit_byte.cpp and bit_byte.h Which is used in most of my code for I/O

  21. #20
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    Thumbs up

    I recompiled scott_trans.c
    and changed m to 10,000,000 for
    10 million bytes that is 80 million bits
    and made it a 64 executable


    /* Get the file size. */
    m = 10000000;
    // m = 1000;
    tT = (unsigned char *)malloc((size_t)(m * sizeof(unsigned char)));
    // *** top of big loop


    When I was cleaning up comments I dropped the other
    write some how and never tested it for multiple blocks
    below is correct

    } else {
    if(n == 0) {
    fwrite(V,sizeof(unsigned char),(size_t)m,stdout);
    fprintf(stderr," \n Done \n");
    exit(0);
    } else {
    fwrite(V,sizeof(unsigned char),(size_t)m,stdout);
    }
    }
    m = n;


    used lzw1_m64.exe just lzw1 with more space simple unchanged
    unoptimized

    100,000,000 enwik8.scott_trans
    26,784,142 enwik8.scott_trans.lzw1_64m
    1,000,000,000 enwik9.scott_trans
    232,638,564 enwik9.scott_trans.lzw1_64m


    will change lzw1 to favor zeros to get better results.


    I know many don't like LZW especially a 2 Root one that would only
    replace a single string each time with 2 strings one bit longer but even
    though coded slow I think it does well. I may also see what happens when
    I code it with bijective arithmetic instead to 2 level huffman

    later

  22. #21
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    160
    Thanks
    18
    Thanked 56 Times in 27 Posts
    I have read your implementation at http://bijective.dogma.net/bwts.zip. I'm surprised that your implementation runs so fast even for the worst case. I also have my own implementation of BWTS, using the O(nlogn) suffix array generating algorithm but it is running too slow. Could you describe your sorting algorithm for me? (I mean the bounded_compare() function and the xx[] array, it's too hard for me to understand)

    Sorry for my poor English

  23. #22
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by RichSelian View Post
    I have read your implementation at http://bijective.dogma.net/bwts.zip. I'm surprised that your implementation runs so fast even for the worst case. I also have my own implementation of BWTS, using the O(nlogn) suffix array generating algorithm but it is running too slow. Could you describe your sorting algorithm for me? (I mean the bounded_compare() function and the xx[] array, it's too hard for me to understand)

    Sorry for my poor English

    Thats OK my English is poor too. However the code your referring to is old. I use xx x for lots of things. I think you will find Yuta's Code in the many version of OPENBWT at this site better. But I did use XX[j] to point to the next location in string that is a different character than the jth character. When the value of J increases it means your going left to right down the string. When it goes negative you are going back to start of lyndon word. The code depending on what version I left on net may not be the best. When you go back you know that string is less then any string to the left of that point so you can stop the compare. If you on two different lyndon words soon or later if match keep occuring the ending condition will occur. If you on the same lyndon word I wrap a few time then declare the left startin one the winner. I am not good at explaining but I hope that helps.

  24. #23
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    160
    Thanks
    18
    Thanked 56 Times in 27 Posts

    Talking

    Quote Originally Posted by biject.bwts View Post
    Thats OK my English is poor too. However the code your referring to is old. I use xx x for lots of things. I think you will find Yuta's Code in the many version of OPENBWT at this site better. But I did use XX[j] to point to the next location in string that is a different character than the jth character. When the value of J increases it means your going left to right down the string. When it goes negative you are going back to start of lyndon word. The code depending on what version I left on net may not be the best. When you go back you know that string is less then any string to the left of that point so you can stop the compare. If you on two different lyndon words soon or later if match keep occuring the ending condition will occur. If you on the same lyndon word I wrap a few time then declare the left startin one the winner. I am not good at explaining but I hope that helps.
    Seems to understand what you say. So your implementation performs badly for the case "abababab..." with O(n^2 * logn) time complexity because the xx[] array cannot improve the speed.

  25. #24
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    I may have written and came up with BWTS but is was Yuta that came up with the first linear time and linear space routine for doing BWTS. Also depending on which version I posted. You can view abababababab...ab as repeats of the same lyndon word. If the code does then this would sort very fast. However in some of my codings I would grap this as one big word. So it would depend on which version I used. However abababab..abc would be in one grouping no matter how I chose to group repeats of a single lyndon word. I think I chose to group them together but it would make no difference in the final sort. It would affect how long it takes to sort though. Also Yuta's code was based on a mod to some fast linear time Chinese method of doing BWT. I suspect since being in China where there are better search engines and that you would probably have access to more documents than anyone saddled with english only searches and because stuff done here tends to be but off limits to normal people unless you pay some site to see the work of others. I have heard that it is less of a porblem in China but I could be wrong. I would like very much if you ever come across C code to do fast BWTS to show us here. Because most of us here are stuck with English only. I suspect since China now has the best education system in the world that most new work we be most likely done in your country anyway.

  26. #25
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    160
    Thanks
    18
    Thanked 56 Times in 27 Posts
    Quote Originally Posted by biject.bwts View Post
    I may have written and came up with BWTS but is was Yuta that came up with the first linear time and linear space routine for doing BWTS. Also depending on which version I posted. You can view abababababab...ab as repeats of the same lyndon word. If the code does then this would sort very fast. However in some of my codings I would grap this as one big word. So it would depend on which version I used. However abababab..abc would be in one grouping no matter how I chose to group repeats of a single lyndon word. I think I chose to group them together but it would make no difference in the final sort. It would affect how long it takes to sort though. Also Yuta's code was based on a mod to some fast linear time Chinese method of doing BWT. I suspect since being in China where there are better search engines and that you would probably have access to more documents than anyone saddled with english only searches and because stuff done here tends to be but off limits to normal people unless you pay some site to see the work of others. I have heard that it is less of a porblem in China but I could be wrong. I would like very much if you ever come across C code to do fast BWTS to show us here. Because most of us here are stuck with English only. I suspect since China now has the best education system in the world that most new work we be most likely done in your country anyway.
    Here is my simple implementation, using the doubling algorithm to generate the suffix array. Generally it is a little slower than your old implementation. But there is exactly no "worst case" becaue its worst time complexity is O(nlogn).
    But I don't know whether there is bugs in the code because I haven't write the decode procedure.

  27. #26
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    160
    Thanks
    18
    Thanked 56 Times in 27 Posts

  28. #27
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    David, do you understand Yuta's difsufsort code? For me it's terribly unreadable and I haven't succeeded in deciphering it. From what I understand he puts both SA and ISA in one 4N-sized table and does some magic to make that work.

  29. #28
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by RichSelian View Post
    I downloaded and tool a quick look at your code just to see it if complies and it did. I am not sure what you want me to do with it. You have my reference code so you should be able to tell if you get the same answer with test files as mine. In many ways the inverse should be easier to code since in some ways the inverse looks simpler than normal UNBWT.

  30. #29
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    David, do you understand Yuta's difsufsort code? For me it's terribly unreadable and I haven't succeeded in deciphering it. From what I understand he puts both SA and ISA in one 4N-sized table and does some magic to make that work.
    The fact is I haven't played with Yuta's BWT code. However I did spend a lot of time looking at his code in BWTS.I which I suspect may be similar to what he did for BWT but I am not sure. Here is what he said to me one time check out G. Nong, S. Zhang and W. H. Chan, "Linear Suffix Array Construction by Almost Pure Induced-Sorting". I did find something on the net about that but like most papers I am not sure if it had much useful info for me. To me Yuta's code itself is the key. I have thought about changing it so one could get a true BWT for output instead of the Suffix sorted array which to me is not pure and seems to be what people are calling BWT nowadays.

    Actually you just break the array to lyndon words and rotate the whole array in question till its one big lyndon word or single lyndon word repeated several time. The index is the number of characters of the array that you rotated and the BWT is the BWTS of the rotated array. So it would not be hard to do if any one actually cared about using the true BWT for the data. But just as it seems it not worth the time to calculate the BWTS it just as likely its not worth the time to calculate the real BWT when people seem happy with using the sorted suffix array thing.

    I think most do not like the way I explain things and it would be foolish for me to explain in words without a black board or something. Since what I would say would be not exactly correct. Maybe its my lack of vocabulary or lack of words. I liked old C and machine code. I like Yuta's style he does not waste space with long names he just gets down and codes thing the way they should. You can can add plenty of writes to see what the values are. But is code is not wasteful and much easier to follow then C++ code.

  31. #30
    Programmer Gribok's Avatar
    Join Date
    Apr 2007
    Location
    USA
    Posts
    159
    Thanks
    0
    Thanked 1 Time in 1 Post
    Quote Originally Posted by Piotr Tarsa View Post
    David, do you understand Yuta's difsufsort code? For me it's terribly unreadable and I haven't succeeded in deciphering it. From what I understand he puts both SA and ISA in one 4N-sized table and does some magic to make that work.
    You only need to sort 50% of suffixes at most, so you can use other memory to speed up sorting like merge sort or for ISA buffer for repeat tandem sorting in worse case.
    Enjoy coding, enjoy life!

Page 1 of 2 12 LastLast

Similar Threads

  1. Schindler Transform (STX)
    By CUDALIKE in forum Data Compression
    Replies: 15
    Last Post: 29th November 2011, 00:40
  2. a very simple transform for english.dic
    By willvarfar in forum Data Compression
    Replies: 8
    Last Post: 1st March 2010, 15:44

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •