1. ## SCOTT TRANSFORM

I am calling it the Scott Transform since its based on a BWTS and few understand BWTS which is necessary for the understanding of this transform. So its not likely in current use.
Will write a 4 stream entropy coder later

Basically it is binary bijective transform that is LENGTH PRESERVING and instead of getting your data in terms of long runs of zeroes and ones you get a unary numbers for the runs that is 1111 or 0000 becomes 0001 you also have the unary numbers in such a way that you could compress the whole series of numbers as one stream. Or you could compress each alternate where you would be compressing the underlying runs of ones and zeroes as separate number streams. Or lastly you can look at the data as 4 separate streams based on what the First and Last column sort to. You need only do a forward pass of the data for the entropy coder and the streams break apart in a natural way. I will hopefully post a good 4 stream entropy coder as a separate program

here is debug output of using it on a file that is "BANANAS"

C:\>scott_transD f bananas.x x.x
Doing the forward scott transform
Version 20110802
01000010 01000001 01001110 01000001 01001110 01000001 01010011
input file total = 56 group = 30 zeros = 36 ones = 20

11110000 00011100 00111000 01111000 10001000 11000000 00001100
normal BWTS total = 56 group = 16 zeros = 36 ones = 20

11100000 00111010 00111000 00111100 00001000 11000000 00011100
modied BWTS total = 56 group = 16 zeros = 36 ones = 20

THE 4 stream representation
IIIIOOOi iiOOOOOO OOOiiooo ooooIIIo IoooIIIo ooooIIII OOOioooo
00010010 01000000 00101000 00010011 10010010 00010001 00110000
total = 56 group = 25 zeros = 41 ones = 15

Like I said its bijective and length preserving please take a look at it.

2. ## Thanks:

R2F6K (7th February 2014)

3. So this is a bitwise BWTS followed by a delta transform (or binary MTF)? Interesting.

768,771 book1
442,501 book1.fpaq0p
768,771 book1.scott_trans
248,970 book1.scott_trans.fpaq0p

513,216 pic
67,909 pic.fpaq0p
513,216 pic.scott_trans
58,928 pic.scott_trans.fpaq0p

4. Well yes you could say that. However I was talking to <Shelwien> and he wishes me to explain more just what it is. So I am going to try in this post.
If you look at http://bijective.dogma.net/compresa.htm#iBBW you see my code (sadly its from my DOS days and none of the executables likely to run on most machines. Also I use a different C compiler MingGW now so they may not compile straight) versus Nagy's. for the binary one zero thing. I really don't have a clue what he did. But I suspect he was just compressing a single string of numbers. On the page you can see I did a BWTS and a MTFQ and wrote a special entropy coder based on two streams. Well I really don't like what I did. First of all I converted to a string of zeroes and ones. This string is often not a multiple
of 8 since I have a way of mapping all 1 0 string to bytes and bytes are a multiple of 8 bits but a string can be any. Second the MTFQ will map to two sets of unary numbers depending on the first bit. it will use either 11110 or 00001 for runs not both. Lastly I am working in Binary so why use a BWTS that use lyndon words that could be all zeroes or ones. I wish to use only lyndon words that start with zero and end in one. Like wise I want all runs to by of form 00001 and no single runs of the form 11110 that really bothered me. SO now fast forward to this
transform

I also want to have the choice of either looking at a single stream of numbers or two streams of numbers where one is of the zeros and the other is of the ones. And what the hell make it so I can use four streams of numbers see below. Lastly I don't want the length to change with the transform so here it is.

First I temporarily add a zero in front of string and a one in back this make the
string 8*n+2 bits long. Then I do the standard binary BWTS in debug mode I
show what that is. Note the BWTS will only have lyndon words that start with zero
and end in one. Also of prime importance is that in the 8*n+2 state there are
exactly as may runs of ones as of zeros. This means the total number of runs is always even. This fact will be use later.

When you do the transform you magically have a one in front and a zero at
end since I add a zero and one and I now take off a zero and one I am back to the modifed BWTS note that total number of ones and zeros the same.

The next fact is that when you look at the First column you have only two runs. A run of zeros followed by a run of ones. Also notice that in the Last column ever where you have this condition FL being 01 there is a matching FL of 10 from the
bottom.

The next trick is to realize that I can use one in front and the zero in back of the
8*n+2 bits and still keep the same size if I drop two other bits somewhere in side the data set.

Since I know it starts with a 1 I encode the run of ones at front as 1 to 1 or 111 to 001. When that is done I go to back end of file notice I have a run of zeros I encode it as 000 to 001 and this is add to output I then continue on back side and encode the set of ones. I keep doing this until I have seen enough zeros to account for the ones that occured in front. I then switch to reading from the front of file again until more ones on top than zeroes on bottom. I do this until I have exhausted 8*n bits.
that leaves 2 bits somewhere in the center of file that are dropped.

When you do the reverse transform you build the 8*n+2 thing.
You can directly compute all the bits from the data. The trick is
calculating the two bits dropped. One way is to look at how the
unary file ended it as either 0001 or 000 and rember in unary mode
there has to be an even number of runs so if odd so far
0001[01] are the missing bits. Note in the 8*n+2 mode the last bit
end of unary number so its always a one. Then the bit next to it
is either a one or zero. What ever makes the total number of runs
in the 8*n+2 thing an even number. Hope that helps
running the D version shows the for separate streams in the output
I is starting run of one from front O is starting run of zeros from back
i is run of ones next to O from the back. o is run of zeroes from the front next to I.

I choose the order to make it easy to use any of one two or four streams. One could really make this transform much more complicated but why do that unless your using a highly modified version for some sort of encryption.

One last point you really have several transforms here. I started at front of file using BWTS. You could start at back. Also since a string you could scan the byte in the opposite direction. You could also change the ones to zeroes and zeroes to ones I have not looked at that but there are dozens of modes. You could customize it based on the data structure of the files in question if they have a common form

Ok I hope this was good enough. I reread it several times. Sadly I see what I want to see and not what is there.

5. Originally Posted by Matt Mahoney
So this is a bitwise BWTS followed by a delta transform (or binary MTF)? Interesting.

768,771 book1
442,501 book1.fpaq0p
768,771 book1.scott_trans
248,970 book1.scott_trans.fpaq0p

513,216 pic
67,909 pic.fpaq0p
513,216 pic.scott_trans
58,928 pic.scott_trans.fpaq0p

Matt I am glad you looked at it. I am glad it makes book1 one compress better than fpaq0p by itself. I don't know how many symbols types are in book1 but I am guessing less than 128. I would be surprised if the scott transfrom which is actually for bitstream compression created less that 250 symbols or maybe even 255. The fact that it compresses at all after this transform means that the symbols that occur the most after this transform are rich in zeroes. like 00000000 may occur the most and those with only one or two bits set to one. Yet I don't think it would help in higher order byte compressions since its unlikely that bytes are related to each other in a simple way. I think you PAQ compress would compress worse after this transform since its byte orientated and its only going to confuse it.

6. Probably. An adaptive order 0 model like fpaq0p is better suited for BWT and I suppose binary BWTS. Anyway I included pic because it is bit oriented data (MSB first) rather than byte oriented.

7. no LTCB results yet then?

8. Originally Posted by willvarfar
no LTCB results yet then?
If you mean Large Text Compression Benchmark you have to realize the transform as written only use 1,000,000 bytes so you would be working on a 1,000 separate blocks. I did write a 64bit exe that does enwik9 for the 256 symbol case as one block it worked on my pc which has 6GB of memory. So it would not be hard to write a 64bit exe to handle enwik8 in one block. So you could have a version that handles enwik9 with 10 blocks. I am playing around with a 4 stream entropy coder at this point and only looking at files of 1,000,000 bytes or less.
Besides if you use one of the better programs of the LTCB they would compress worse since they are more byte orientated than bit. Also there code would not be aware of the 4 separate streams so that info would not be used causing worse compression. At least that is my gut instinct belief. But I would be happier if I was wrong.

9. It's a transform, not a compressor, so no LTCB benchmark yet. It still needs a back end model and coder.

10. Yes its a transform not a compressor it changes the length by ZERO

however using the worst compressor from the LTCB namely arb2x
which uses a single stream and has one cell that counts only zero bits and one bits
you get below. Added this to Matt test and used my bijective 256 symbol coder arb255

768,771 book1
763,152 book1.arb2x
435,125 book1.arb255
442,501 book1.fpaq0p
768,771 book1.scott_trans
248,970 book1.scott_trans.fpaq0p
367,326 book1.scott_trans.arb2x
279,327 book1.scott_trans.arb255

Not sure if I have the MinGW versions of arb.. out there but
will update in a few days and include the 2 stream and 4 stream
versions of arb2x for a quick test of the 4 streams.

I am really surprised that fpaq0p beat arb255 in this case.
arb255 is a true optimal stationary arithmetic compressor.
I am not surprised it beat it on the file book1 but the fact
it beat it on book1.scott_trans by a larger margin is a
mystery to me.

By that I mean take any byte permutation of the file and arb255
will compress to same size length but could span a byte so
could get n and n+1 length for some files. I would be shocked
if there was no permutation of book1.scott_transform where
fpaq0p does not do worse than arb255. I think it's a tuned
nonstationary arithmetic compressor. Why this transform
which in my mind is a spread spectrum sort of thing would
allow this to occur is very interesting. Since its not the sort
of file that the code was tuned for. Oh well it makes life
interesting.

11. Originally Posted by biject.bwts
I am glad it makes book1 one compress better than fpaq0p by itself. I don't know how many symbols types are in book1 but I am guessing less than 128. I would be surprised if the scott transfrom which is actually for bitstream compression created less that 250 symbols or maybe even 255. The fact that it compresses at all after this transform means that the symbols that occur the most after this transform are rich in zeroes. like 00000000 may occur the most and those with only one or two bits set to one.
book1:
82 different bytes
entropy: 4.52
55.03% of zeroes / 44.97% of ones

book1.scott:
256 different bytes (550244 times 0x00 out of 768771, that's 71.57%)
entropy: 2.90
89.72% of zeroes / 10.28% of ones

12. For BWT you need a fast adapting model, not a stationary one. Here are a few more experiments.

Code:
```768,771 book1
768,771 book1.scott_trans

231,899 book1-m1.zpaq
250,286 book1.scott_trans-1.zpaq
252,387 book1.scott_trans.fpaq0f2

215,277 book1-m2.zpaq
237,694 book1.scott_trans-2.zpaq

277,515 book1.scott_trans.fpaq0
262,988 book1.scott_trans-cm1-255.zpaq

248,970 book1.scott_trans.fpaq0p
249,951 book1.scott_trans-cm1-8.zpaq```
book1-m1.zpaq is compressed with zpaq -m1. It uses BWT+RLE+ICM0. scott_trans-1 takes out the BWT pre/post but uses the same model. The config file is:
Code:
```comp 1 0 0 0 1
0 icm 5
hcomp
halt
post 0 end```
fpaq0f2 also uses an indirect model like this but with a different bit history.

zpaq -m2 uses BWT+ICM0-ISSE1 (no RLE). Again I replaced the BWT with scott_trans.
Code:
```comp 1 0 0 0 2
0 icm 5
1 isse 12 0
hcomp
d= 1 *d=0 hashd halt
post 0 end```
It uses an order 0 ICM like before, but then adjusts the prediction using an order 1 ISSE. An ISSE is a 2 input mixer taking the ICM prediction and a fixed constant as input and the weights are selected by an order 1 bit history.

fpaq0 is a stationary order 0 model. -cm1-255.zpaq is similar:
Code:
```comp 1 0 0 0 1
0 cm 9 255
hcomp
halt
post 0 end```
zpaq does not have a pure stationary model but this is pretty close. A CM maps a context directly to a prediction. The context is order 0, thus HCOMP is empty. But there is an implied bitwise context expanded to 9 bits to reduce cache misses. The arguments to the CM are the context size in bits and 1/4 the maximum count. The learning rate is 1/count where count is limited to 255*4, which is the maximum supported.

fpaq0p is an adaptive direct context model with a learning rate of 1/32. This is equivalent to "CM 9 8", replacing line 0 above in cm1-8.zpaq above except that the learning rate is faster until the count reaches 32.

13. Matt when you say you replaced BWT with the scott_trans. Are you replacing a BWT that sorts bytes with one that sorts bits. Don't really expect to much gain if code was designed for bytes. Here is code using lzw1_m32 Its a simple bijective 2 Root LZW where the compression changes very slowly since each pass replaces a string of bits with 2 strings that are one bit longer. This is the MinGW version of LZW1 but it's designed to work with a string of bits. I think ones need a special string compressor that favors zero over ones out the gate to get really small values and this compressor does not do that but its interesting.

768,771 book1
768,771 book1.scott_trans
491,700 book1.lzw1_m32
244,769 book1.scott_trans.lzw1_m32

file bananas 7 characters "BANANAS"
7 bananas.scott_trans
9 bananas.lzw1_32
6 bananas.scott_trans.lzw1_m32

14. Originally Posted by biject.bwts
Matt when you say you replaced BWT with the scott_trans. Are you replacing a BWT that sorts bytes with one that sorts bits. Don't really expect to much gain if code was designed for bytes. Here is code using lzw1_m32 Its a simple bijective 2 Root LZW where the compression changes very slowly since each pass replaces a string of bits with 2 strings that are one bit longer. This is the MinGW version of LZW1 but it's designed to work with a string of bits. I think ones need a special string compressor that favors zero over ones out the gate to get really small values and this compressor does not do that but its interesting.

768,771 book1
768,771 book1.scott_trans
491,700 book1.lzw1_m32
244,769 book1.scott_trans.lzw1_m32

file bananas 7 characters "BANANAS"
7 bananas.scott_trans
9 bananas.lzw1_32
6 bananas.scott_trans.lzw1_m32
MY MISTAKE I SHOULD CUT AND PASTE
8 bananas.lzw1_m32 meant to add one to length and forgot m before 32 I think I will have another fine German Beer and call it a nite my eyes are tired.

15. Originally Posted by biject.bwts
lzw1_m32 Its a simple bijective 2 Root LZW where the compression changes very slowly since each pass replaces a string of bits with 2 strings that are one bit longer.
Shouldn't that be shorter rather than longer?

16. I'm using byte oriented BWT so I expect that to be better than bit oriented.

17. Originally Posted by caveman
Shouldn't that be shorter rather than longer?
Well it depends on how you think of it. Here is another way to think about it. At start you have two strings "0 and "1" suppose the file being compressed is 01.... than you output a token either either a one or zero you then create two new strings so you have 3 strings 01 00 and 1 you toss the 0 string in the trash bin you never use it again. Example suppose you have 8 strings your looking at 000 001 010 011 100 101 110 111 you have 8 strings and you need to output a string that represents which string was used you need exactly 3 bits so when this is the table you get no compression. But suppose the input file is a string of zeroes then you eventual get this working set of strings
1 01 001 0001 00001 000001 0000001 0000000 in this case you still need to output 3 bits for any string if the input all zeroes you out 3 bits for the current string of 7 zeros then you trash that string and create 2 new longer strings one with 8 zeroes and the next with 7 zeroes followed by a 1. In fact you actually save output space when you have this set most of the time you expand by 2 bits if input was a 1 expand by 1 bit if 01 you break even if input 001 on everything else you are shortening the output file.

Its that simple hope this helps

The code for LZW1 was very simple. It can be made to compress this kind of rich zero file better. The reason is I used two level huffman which is not as good as arithmetic. Example suppose the there are 6 strings in tree I use the two level huffman 2 tokens get written in 2 bits and 4 get written with 3 bits. If either string equally likely then the average length in bits out is (2*2 +4*3)/6 = 2.66666667 bits out on average if you used arithmetic the average would be ln(6)/ln(2) = 2.5849625 so clearly in general case arithmetic better. However if one tuning LZW1 for a rich zero environment you should have the strings arranged and sorted such that those ending in 0 use the 2 bits and those that end in 1 use the 3 bits. This would greatly help in compression when preceded by scott_trans if I get the time I will mod LZW1 for a version tuned for just the 0 rich case.

18. Originally Posted by Matt Mahoney
I'm using byte oriented BWT so I expect that to be better than bit oriented.
Matt your code tuned with byte oriented BWT and if you substituted BWTS directly and then dropped the coding of the INDEX it would likely compress much better than using the bit oriented Scott_trans. I think your original BWT would still beat on the average byte BWTS but not by much since they are almost the same. But your stuff was tuned using the BWT that you use.
As for the Scott_trans it kind of says you can compress only if rich in zeros. Therefore the after stages would have to assume a rich zero input of zeros. Just like I am sure I can make an LZW1 better if I change it to expect more zeroes it would compress highly redundant files better. Of course since any output is possible it being a bijective transform you could have a rich one output file. But I think it would come from a random looking file before the transfrom. Most random input files would produce random output files.

19. Originally Posted by biject.bwts
Well it depends on how you think of it. Here is another way to think about it. At start you have two strings "0 and "1" suppose the file being compressed is 01.... than you output a token either either a one or zero you then create two new strings so you have 3 strings 01 00 and 1 you toss the 0 string in the trash bin you never use it again. Example suppose you have 8 strings your looking at 000 001 010 011 100 101 110 111 you have 8 strings and you need to output a string that represents which string was used you need exactly 3 bits so when this is the table you get no compression. But suppose the input file is a string of zeroes then you eventual get this working set of strings
1 01 001 0001 00001 000001 0000001 0000000 in this case you still need to output 3 bits for any string if the input all zeroes you out 3 bits for the current string of 7 zeros then you trash that string and create 2 new longer strings one with 8 zeroes and the next with 7 zeroes followed by a 1. In fact you actually save output space when you have this set most of the time you expand by 2 bits if input was a 1 expand by 1 bit if 01 you break even if input 001 on everything else you are shortening the output file.

Its that simple hope this helps
I get it, it's a bit like dynamic Tunstall codes (variable length to fixed length codes), for instance if you have 0.653683% of zeros and 0.346317% of ones, 8 bit Tunstall codes look like these (codes [0-55] expand the output, codes [56-146] break even, codes [147-255] save between 1 and 4 bits), Tunstall codes themselves could be output using an entropy coder since their probability of occurrence are skewed (this would result in a variable length to variable length coder).
Code:
``` Tun  Size Probability Coded bits
0    5   0.0049816  11111
1    6   0.0032564  111101
2    6   0.0032564  111011
3    6   0.0032564  110111
4    6   0.0032564  101111
5    6   0.0032564  011111
6    7   0.0021286  1111001
7    7   0.0040179  1111000
8    7   0.0021286  1110101
9    7   0.0040179  1110100
10    7   0.0021286  1110011
11    7   0.0040179  1110010
12    7   0.0040179  1110001
13    7   0.0021286  1101101
14    7   0.0040179  1101100
15    7   0.0021286  1101011
16    7   0.0040179  1101010
17    7   0.0040179  1101001
18    7   0.0021286  1100111
19    7   0.0040179  1100110
20    7   0.0040179  1100101
21    7   0.0040179  1100011
22    7   0.0021286  1011101
23    7   0.0040179  1011100
24    7   0.0021286  1011011
25    7   0.0040179  1011010
26    7   0.0040179  1011001
27    7   0.0021286  1010111
28    7   0.0040179  1010110
29    7   0.0040179  1010101
30    7   0.0040179  1010011
31    7   0.0021286  1001111
32    7   0.0040179  1001110
33    7   0.0040179  1001101
34    7   0.0040179  1001011
35    7   0.0040179  1000111
36    7   0.0021286  0111101
37    7   0.0040179  0111100
38    7   0.0021286  0111011
39    7   0.0040179  0111010
40    7   0.0040179  0111001
41    7   0.0021286  0110111
42    7   0.0040179  0110110
43    7   0.0040179  0110101
44    7   0.0040179  0110011
45    7   0.0021286  0101111
46    7   0.0040179  0101110
47    7   0.0040179  0101101
48    7   0.0040179  0101011
49    7   0.0040179  0100111
50    7   0.0021286  0011111
51    7   0.0040179  0011110
52    7   0.0040179  0011101
53    7   0.0040179  0011011
54    7   0.0040179  0010111
55    7   0.0040179  0001111
56    8   0.0026264  11100001
57    8   0.0049574  11100000
58    8   0.0026264  11010001
59    8   0.0049574  11010000
60    8   0.0026264  11001001
61    8   0.0049574  11001000
62    8   0.0026264  11000101
63    8   0.0049574  11000100
64    8   0.0026264  11000011
65    8   0.0049574  11000010
66    8   0.0049574  11000001
67    8   0.0026264  10110001
68    8   0.0049574  10110000
69    8   0.0026264  10101001
70    8   0.0049574  10101000
71    8   0.0026264  10100101
72    8   0.0049574  10100100
73    8   0.0026264  10100011
74    8   0.0049574  10100010
75    8   0.0049574  10100001
76    8   0.0026264  10011001
77    8   0.0049574  10011000
78    8   0.0026264  10010101
79    8   0.0049574  10010100
80    8   0.0026264  10010011
81    8   0.0049574  10010010
82    8   0.0049574  10010001
83    8   0.0026264  10001101
84    8   0.0049574  10001100
85    8   0.0026264  10001011
86    8   0.0049574  10001010
87    8   0.0049574  10001001
88    8   0.0026264  10000111
89    8   0.0049574  10000110
90    8   0.0049574  10000101
91    8   0.0049574  10000011
92    8   0.0026264  01110001
93    8   0.0049574  01110000
94    8   0.0026264  01101001
95    8   0.0049574  01101000
96    8   0.0026264  01100101
97    8   0.0049574  01100100
98    8   0.0026264  01100011
99    8   0.0049574  01100010
100    8   0.0049574  01100001
101    8   0.0026264  01011001
102    8   0.0049574  01011000
103    8   0.0026264  01010101
104    8   0.0049574  01010100
105    8   0.0026264  01010011
106    8   0.0049574  01010010
107    8   0.0049574  01010001
108    8   0.0026264  01001101
109    8   0.0049574  01001100
110    8   0.0026264  01001011
111    8   0.0049574  01001010
112    8   0.0049574  01001001
113    8   0.0026264  01000111
114    8   0.0049574  01000110
115    8   0.0049574  01000101
116    8   0.0049574  01000011
117    8   0.0026264  00111001
118    8   0.0049574  00111000
119    8   0.0026264  00110101
120    8   0.0049574  00110100
121    8   0.0026264  00110011
122    8   0.0049574  00110010
123    8   0.0049574  00110001
124    8   0.0026264  00101101
125    8   0.0049574  00101100
126    8   0.0026264  00101011
127    8   0.0049574  00101010
128    8   0.0049574  00101001
129    8   0.0026264  00100111
130    8   0.0049574  00100110
131    8   0.0049574  00100101
132    8   0.0049574  00100011
133    8   0.0026264  00011101
134    8   0.0049574  00011100
135    8   0.0026264  00011011
136    8   0.0049574  00011010
137    8   0.0049574  00011001
138    8   0.0026264  00010111
139    8   0.0049574  00010110
140    8   0.0049574  00010101
141    8   0.0049574  00010011
142    8   0.0026264  00001111
143    8   0.0049574  00001110
144    8   0.0049574  00001101
145    8   0.0049574  00001011
146    8   0.0049574  00000111
147    9   0.0032406  110000001
148    9   0.0032406  101000001
149    9   0.0061167  101000000
150    9   0.0032406  100100001
151    9   0.0061167  100100000
152    9   0.0032406  100010001
153    9   0.0061167  100010000
154    9   0.0032406  100001001
155    9   0.0061167  100001000
156    9   0.0032406  100000101
157    9   0.0061167  100000100
158    9   0.0032406  100000011
159    9   0.0032406  011000001
160    9   0.0061167  011000000
161    9   0.0032406  010100001
162    9   0.0061167  010100000
163    9   0.0032406  010010001
164    9   0.0061167  010010000
165    9   0.0032406  010001001
166    9   0.0061167  010001000
167    9   0.0032406  010000101
168    9   0.0061167  010000100
169    9   0.0032406  010000011
170    9   0.0032406  001100001
171    9   0.0061167  001100000
172    9   0.0032406  001010001
173    9   0.0061167  001010000
174    9   0.0032406  001001001
175    9   0.0061167  001001000
176    9   0.0032406  001000101
177    9   0.0061167  001000100
178    9   0.0032406  001000011
179    9   0.0032406  000110001
180    9   0.0061167  000110000
181    9   0.0032406  000101001
182    9   0.0061167  000101000
183    9   0.0032406  000100101
184    9   0.0061167  000100100
185    9   0.0032406  000100011
186    9   0.0032406  000011001
187    9   0.0061167  000011000
188    9   0.0032406  000010101
189    9   0.0061167  000010100
190    9   0.0032406  000010011
191    9   0.0061167  000010010
192    9   0.0061167  000010001
193    9   0.0032406  000001101
194    9   0.0061167  000001100
195    9   0.0032406  000001011
196    9   0.0032406  000000111
197    9   0.0061167  000000011
198   10   0.0021183  1100000001
199   10   0.0039984  1100000000
200   10   0.0021183  1000000101
201   10   0.0039984  1000000100
202   10   0.0021183  1000000011
203   10   0.0039984  1000000010
204   10   0.0039984  1000000001
205   10   0.0021183  0100000101
206   10   0.0039984  0100000100
207   10   0.0021183  0100000011
208   10   0.0039984  0100000010
209   10   0.0039984  0100000001
210   10   0.0021183  0010000101
211   10   0.0039984  0010000100
212   10   0.0021183  0010000011
213   10   0.0039984  0010000010
214   10   0.0039984  0010000001
215   10   0.0021183  0001000101
216   10   0.0039984  0001000100
217   10   0.0021183  0001000011
218   10   0.0039984  0001000010
219   10   0.0039984  0001000001
220   10   0.0039984  0000100001
221   10   0.0021183  0000010101
222   10   0.0039984  0000010100
223   10   0.0021183  0000010011
224   10   0.0039984  0000010010
225   10   0.0039984  0000010001
226   10   0.0021183  0000001101
227   10   0.0039984  0000001100
228   10   0.0021183  0000001011
229   10   0.0039984  0000001010
230   10   0.0039984  0000001001
231   10   0.0039984  0000000101
232   10   0.0039984  0000000011
233   11   0.0026137  10000000001
234   11   0.0049334  10000000000
235   11   0.0026137  01000000001
236   11   0.0049334  01000000000
237   11   0.0026137  00100000001
238   11   0.0049334  00100000000
239   11   0.0026137  00010000001
240   11   0.0049334  00010000000
241   11   0.0026137  00001000001
242   11   0.0049334  00001000000
243   11   0.0026137  00000100001
244   11   0.0049334  00000100000
245   11   0.0026137  00000010001
246   11   0.0049334  00000010000
247   11   0.0026137  00000001001
248   11   0.0049334  00000001000
249   11   0.0026137  00000000101
250   11   0.0049334  00000000100
251   11   0.0026137  00000000011
252   11   0.0049334  00000000010
253   11   0.0049334  00000000001
254   12   0.0032249  000000000001
255   12   0.0060871  000000000000```

20. Originally Posted by caveman
I get it, it's a bit like dynamic Tunstall codes (variable length to fixed length codes), for instance if you have 0.653683% of zeros and 0.346317% of ones, 8 bit Tunstall codes look like these (codes [0-55] expand the output, codes [56-146] break even, codes [147-255] save between 1 and 4 bits), Tunstall codes themselves could be output using an entropy coder since their probability of occurrence are skewed (this would result in a variable length to variable length coder).
Code:
``` Tun  Size Probability Coded bits
0    5   0.0049816  11111
1    6   0.0032564  111101
2    6   0.0032564  111011
3    6   0.0032564  110111
4    6   0.0032564  101111
5    6   0.0032564  011111
......
252   11   0.0049334  00000000010
253   11   0.0049334  00000000001
254   12   0.0032249  000000000001
255   12   0.0060871  000000000000```
Well I see what you mean but no it's not the same. For example with your table I guess you are assuming input in single bytes. So that that you have 8 bits in and for output you have this fixed table. where you output 5 to 12 bits output. Unless you look at ratio of bits seen so far and then make a new table. But that would be a lot of work.

Here is another example for LZW1 2 root
You look at file as one big string. The first time you have 2 input strings and two output strings each one bit long. So in effect you writing 1 of two possible strings. Then your read what ever amount of input needed to go end of tree. Then you output 1 of three. Since using bits you output either a single bit for 1 of the 3 then for each of the other you
output 2 bits. You create a new table and then output for strings each 2 bits longer.

you could think of the outputs as (1 of 2 ) then (1 of 3) then (1 of 4) and so on till file ends. so in looking at number of bits out in first 3 token you have either 4 bits out or 5 bits out.
However on input you scan first one bit then output a token then scan either one or two bits and output a token then scan either 1 up to 3 an output third token

example 000000 if that is input here is input output tree
input output
0 0
1 1

next tree
input output
00 0
01 01
1 11

next tree
input output
000 00
001 01
01 10
1 11

so 0 00 000 goes to 0 0 00 you save 2 bits.

next example 0110
first tree
input output
0 0
1 1
next tree
input output
00 0
01 01
1 11
next tree
00 00
01 01
10 10
11 11

so 0 1 10 becomes 0 11 10 so with this sequence you get one bit longer.

The end of input and end of output are handled by my bijective bit I/O since treating the file as a string which by the way is not usually 8*n bits. But look at code to see this
the bit_byte.cpp and bit_byte.h Which is used in most of my code for I/O

21. I recompiled scott_trans.c
and changed m to 10,000,000 for
10 million bytes that is 80 million bits
and made it a 64 executable

/* Get the file size. */
m = 10000000;
// m = 1000;
tT = (unsigned char *)malloc((size_t)(m * sizeof(unsigned char)));
// *** top of big loop

When I was cleaning up comments I dropped the other
write some how and never tested it for multiple blocks
below is correct

} else {
if(n == 0) {
fwrite(V,sizeof(unsigned char),(size_t)m,stdout);
fprintf(stderr," \n Done \n");
exit(0);
} else {
fwrite(V,sizeof(unsigned char),(size_t)m,stdout);
}
}
m = n;

used lzw1_m64.exe just lzw1 with more space simple unchanged
unoptimized

100,000,000 enwik8.scott_trans
26,784,142 enwik8.scott_trans.lzw1_64m
1,000,000,000 enwik9.scott_trans
232,638,564 enwik9.scott_trans.lzw1_64m

will change lzw1 to favor zeros to get better results.

I know many don't like LZW especially a 2 Root one that would only
replace a single string each time with 2 strings one bit longer but even
though coded slow I think it does well. I may also see what happens when
I code it with bijective arithmetic instead to 2 level huffman

later

22. I have read your implementation at http://bijective.dogma.net/bwts.zip. I'm surprised that your implementation runs so fast even for the worst case. I also have my own implementation of BWTS, using the O(nlogn) suffix array generating algorithm but it is running too slow. Could you describe your sorting algorithm for me? (I mean the bounded_compare() function and the xx[] array, it's too hard for me to understand)

Sorry for my poor English

23. Originally Posted by RichSelian
I have read your implementation at http://bijective.dogma.net/bwts.zip. I'm surprised that your implementation runs so fast even for the worst case. I also have my own implementation of BWTS, using the O(nlogn) suffix array generating algorithm but it is running too slow. Could you describe your sorting algorithm for me? (I mean the bounded_compare() function and the xx[] array, it's too hard for me to understand)

Sorry for my poor English

Thats OK my English is poor too. However the code your referring to is old. I use xx x for lots of things. I think you will find Yuta's Code in the many version of OPENBWT at this site better. But I did use XX[j] to point to the next location in string that is a different character than the jth character. When the value of J increases it means your going left to right down the string. When it goes negative you are going back to start of lyndon word. The code depending on what version I left on net may not be the best. When you go back you know that string is less then any string to the left of that point so you can stop the compare. If you on two different lyndon words soon or later if match keep occuring the ending condition will occur. If you on the same lyndon word I wrap a few time then declare the left startin one the winner. I am not good at explaining but I hope that helps.

24. Originally Posted by biject.bwts
Thats OK my English is poor too. However the code your referring to is old. I use xx x for lots of things. I think you will find Yuta's Code in the many version of OPENBWT at this site better. But I did use XX[j] to point to the next location in string that is a different character than the jth character. When the value of J increases it means your going left to right down the string. When it goes negative you are going back to start of lyndon word. The code depending on what version I left on net may not be the best. When you go back you know that string is less then any string to the left of that point so you can stop the compare. If you on two different lyndon words soon or later if match keep occuring the ending condition will occur. If you on the same lyndon word I wrap a few time then declare the left startin one the winner. I am not good at explaining but I hope that helps.
Seems to understand what you say. So your implementation performs badly for the case "abababab..." with O(n^2 * logn) time complexity because the xx[] array cannot improve the speed.

25. I may have written and came up with BWTS but is was Yuta that came up with the first linear time and linear space routine for doing BWTS. Also depending on which version I posted. You can view abababababab...ab as repeats of the same lyndon word. If the code does then this would sort very fast. However in some of my codings I would grap this as one big word. So it would depend on which version I used. However abababab..abc would be in one grouping no matter how I chose to group repeats of a single lyndon word. I think I chose to group them together but it would make no difference in the final sort. It would affect how long it takes to sort though. Also Yuta's code was based on a mod to some fast linear time Chinese method of doing BWT. I suspect since being in China where there are better search engines and that you would probably have access to more documents than anyone saddled with english only searches and because stuff done here tends to be but off limits to normal people unless you pay some site to see the work of others. I have heard that it is less of a porblem in China but I could be wrong. I would like very much if you ever come across C code to do fast BWTS to show us here. Because most of us here are stuck with English only. I suspect since China now has the best education system in the world that most new work we be most likely done in your country anyway.

26. Originally Posted by biject.bwts
I may have written and came up with BWTS but is was Yuta that came up with the first linear time and linear space routine for doing BWTS. Also depending on which version I posted. You can view abababababab...ab as repeats of the same lyndon word. If the code does then this would sort very fast. However in some of my codings I would grap this as one big word. So it would depend on which version I used. However abababab..abc would be in one grouping no matter how I chose to group repeats of a single lyndon word. I think I chose to group them together but it would make no difference in the final sort. It would affect how long it takes to sort though. Also Yuta's code was based on a mod to some fast linear time Chinese method of doing BWT. I suspect since being in China where there are better search engines and that you would probably have access to more documents than anyone saddled with english only searches and because stuff done here tends to be but off limits to normal people unless you pay some site to see the work of others. I have heard that it is less of a porblem in China but I could be wrong. I would like very much if you ever come across C code to do fast BWTS to show us here. Because most of us here are stuck with English only. I suspect since China now has the best education system in the world that most new work we be most likely done in your country anyway.
Here is my simple implementation, using the doubling algorithm to generate the suffix array. Generally it is a little slower than your old implementation. But there is exactly no "worst case" becaue its worst time complexity is O(nlogn).
But I don't know whether there is bugs in the code because I haven't write the decode procedure.

27. David, do you understand Yuta's difsufsort code? For me it's terribly unreadable and I haven't succeeded in deciphering it. From what I understand he puts both SA and ISA in one 4N-sized table and does some magic to make that work.

28. Originally Posted by RichSelian
I downloaded and tool a quick look at your code just to see it if complies and it did. I am not sure what you want me to do with it. You have my reference code so you should be able to tell if you get the same answer with test files as mine. In many ways the inverse should be easier to code since in some ways the inverse looks simpler than normal UNBWT.

29. Originally Posted by Piotr Tarsa
David, do you understand Yuta's difsufsort code? For me it's terribly unreadable and I haven't succeeded in deciphering it. From what I understand he puts both SA and ISA in one 4N-sized table and does some magic to make that work.
The fact is I haven't played with Yuta's BWT code. However I did spend a lot of time looking at his code in BWTS.I which I suspect may be similar to what he did for BWT but I am not sure. Here is what he said to me one time check out G. Nong, S. Zhang and W. H. Chan, "Linear Suffix Array Construction by Almost Pure Induced-Sorting". I did find something on the net about that but like most papers I am not sure if it had much useful info for me. To me Yuta's code itself is the key. I have thought about changing it so one could get a true BWT for output instead of the Suffix sorted array which to me is not pure and seems to be what people are calling BWT nowadays.

Actually you just break the array to lyndon words and rotate the whole array in question till its one big lyndon word or single lyndon word repeated several time. The index is the number of characters of the array that you rotated and the BWT is the BWTS of the rotated array. So it would not be hard to do if any one actually cared about using the true BWT for the data. But just as it seems it not worth the time to calculate the BWTS it just as likely its not worth the time to calculate the real BWT when people seem happy with using the sorted suffix array thing.

I think most do not like the way I explain things and it would be foolish for me to explain in words without a black board or something. Since what I would say would be not exactly correct. Maybe its my lack of vocabulary or lack of words. I liked old C and machine code. I like Yuta's style he does not waste space with long names he just gets down and codes thing the way they should. You can can add plenty of writes to see what the values are. But is code is not wasteful and much easier to follow then C++ code.

30. Originally Posted by Piotr Tarsa
David, do you understand Yuta's difsufsort code? For me it's terribly unreadable and I haven't succeeded in deciphering it. From what I understand he puts both SA and ISA in one 4N-sized table and does some magic to make that work.
You only need to sort 50% of suffixes at most, so you can use other memory to speed up sorting like merge sort or for ISA buffer for repeat tandem sorting in worse case.

Page 1 of 2 12 Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•