# Schindler Transform (STX)

• 15th July 2011, 00:27
CUDALIKE
Schindler Transform (STX)
The question is.
Suppose we have two sequences
1: AAABAA
2: AABAAA

Use ST2 (STX):
And we get out of 1st and 2nd row from the same recent columns.
AA
AA
AA
AA
AB
BA

From different lines have the same result.
Or where is wrong?
• 15th July 2011, 00:51
Shelwien
With STn you still sort the whole string shifts like in BWT, and still output the last column.
The difference is that you only include n symbols in comparisons.

Also in STn you'd still need a EOF symbol or an index position like in BWT.
• 15th July 2011, 12:57
CUDALIKE
1: AAABAA
2: AABAAA

Use ST2 (STX):
1.
shift matrix:
AAABAA <- index0
AAAABA
AAAAAB
BAAAAA
ABAAAA
AABAAA

after sort:
AAABAA <- index0
AAAABA
AAAAAB
AABAAA
ABAAAA
BAAAAA

Result : AAAABA
index0: 0

2.
shift matrix:
AABAAA <- index0
AAABAA
AAAABA
AAAAAB
BAAAAA
ABAAAA

after sort:
AABAAA <- index0
AAABAA
AAAABA
AAAAAB
ABAAAA
BAAAAA

Result : AAAABA <- equal to 1
index0: 0

Or as a result I do not have to take second, and the last column?
For 1: AABAAA
For 2: AAABAA
• 15th July 2011, 13:11
Piotr Tarsa
You have to take the last column.
• 15th July 2011, 13:17
Shelwien
Well, there's a different (and much simpler) way to build ST2.
Just count the occurrences of symbol pairs, then allocate a buffer for each pair, then put
the symbols encountered in context of that pair into that buffer.

Something like this:
Code:

``` uint Forward_ST2( byte* inpbuf, byte* outbuf, uint inplen ) {   uint i,c,p,q;   memset( o2, 0, sizeof(o2) );   p = inpbuf[inplen-2];   q = inpbuf[inplen-1];   for( i=0; i<inplen; i++ ) {     c = inpbuf[i];     o2[1+p+(q<<8)]++;     p = q; q = c;   }   for( i=0,c=2; i<CNUM*CNUM; i++ ) o2[i]=c, c+=o2[i+1];   p = inpbuf[inplen-2]; outbuf[0] = p;   q = inpbuf[inplen-1]; outbuf[1] = q;   for( i=0; i<inplen; i++ ) {     c = outbuf[ o2[p+(q<<8)]++ ] = inpbuf[i];     p = q; q = c;   }   return inplen+2; } uint Inverse_ST2( byte* inpbuf, byte* outbuf, uint inplen ) {   uint i,j,c,p,q;   memset( o1, 0, sizeof(o2) );   memset( o2, 0, sizeof(o2) );   for( i=2; i<inplen; i++ ) c=inpbuf[i], o1[1+c]++; // symbol freqs   for( i=0,c=2; i<CNUM; i++ ) o1[i]=c, c+=o1[i+1]; o1[CNUM]=c; // now offsets   for( j=0; j<CNUM; j++ ) { // symbol loop     q = j;     for( i=o1[j]; i<o1[j+1]; i++ ) {       c = inpbuf[i]<<8;       o2[1+q+c]++; // byte freqs     }   }   for( i=0,c=2; i<CNUM*CNUM; i++ ) o2[i]=c, c+=o2[i+1];   p = inpbuf[0];   q = inpbuf[1];   for( i=0; i<inplen-2; i++ ) {     c = outbuf[i] = inpbuf[ o2[p+(q<<8)]++ ];     p = q; q = c;   }   return inplen-2; }```
• 15th July 2011, 22:14
biject.bwts
with something as simple as
AABAAA and AAABAA
why not use the BWTS transform and then you don't need a stupid INDEX

AABAAA > AAB AAA two lyndon words the strings of interest
are AAA AAA AAA AAB ABA BAA which sort to
AAA
AAA
AAA
AAB
ABA
BAA

giving AAABAA as the BWTS of AABAAA

while
AAABAA > AAAB AA two lyndon words with strings
AA AA AAAB AABA ABAA BAAA
sorting
AAA Note repeat string to match length of longest lyndon word
AAA
AAAB
AABA
ABAA
BAAA Use last symbol of strings of interest in case for the length 2
lyndon words you use the second symbol.

giving AABAAA as the BWTS string of AAABAA and again no need for an INDEX
• 1st August 2011, 02:48
CUDALIKE
All thanks.
biject.bwts I do not fully understand a tricky algorithm. This is a simple BWT?

Shelwien Thanks, it significantly speeds up the decoding ST2.
And explain what the difference in the inverse transform algorithms for different X - differ from ST2_decode ST8_decode?
I wonder how, in general, and optimized versions.
• 1st August 2011, 23:48
Shelwien
You can see the ST3,ST4,ST5 implementation in libbsc.
In short, ST4+ has to be much more complicated than ST2, because allocating 16G static arrays for it is still troublesome atm.
• 10th August 2011, 03:00
CUDALIKE
I looked bsc_st.cpp decoding STX - not so difficult to implement, but for a long time to execute.
In fact, it turns out that sorting the entire length of the contexts we have shifted from compression to decompression. However, decompression is use is usually much more often than compression and we get the loss in whole . Or is there something wrong?

And presumably, that should be faster encoding + decoding for STX or for BWT, ie that faster unSTX (STX) or unBWT (BWT)?
• 10th August 2011, 04:36
Gribok
Only reverse ST3 transform is faster than BWT. BWT also can be easy optimized for multi-core systems where ST is not.
• 13th November 2011, 22:11
CUDALIKE
Quote:

Originally Posted by Gribok
Only reverse ST3 transform is faster than BWT. BWT also can be easy optimized for multi-core systems where ST is not.

About BSC. Why the secondary indexes in BWT and why they are not used?
I guess they are just for parallelization BWT inverse transform, right?

Quote:

Originally Posted by bwt.h
* @param indexes - the secondary indexes array, can be NULL.
int bsc_bwt_encode(unsigned char * T, int n, unsigned char * num_indexes, int * indexes, int features);

• 13th November 2011, 22:29
Gribok
Quote:

Originally Posted by CUDALIKE
About BSC. Why the secondary indexes in BWT and why they are not used?
I guess they are just for parallelization BWT inverse transform, right?

They are used in parallel decompression and they are on by default.
• 14th November 2011, 19:06
CUDALIKE
Quote:

Originally Posted by Gribok
They are used in parallel decompression and they are on by default.

I think it's just as easy to use to speed up BWT inverse on the GPU. Do you plan this?
• 15th November 2011, 02:07
Gribok
Quote:

Originally Posted by CUDALIKE
I think it's just as easy to use to speed up BWT inverse on the GPU. Do you plan this?

Random memory access is slow on CUDA. It is not clear to me if CUDA will be faster then CPU, but I may try.
• 15th November 2011, 10:58
CUDALIKE
Quote:

Originally Posted by Gribok
Random memory access is slow on CUDA. It is not clear to me if CUDA will be faster then CPU, but I may try.

Yes. But there is a possibility to use all the cores. To count the number of different characters all the cores and threads. To save the intermediate data can be used shared memory.
I think you can speed up tenfold more than the CPU.
Even if we can speed up a bit - it will share the load between the CPU and GPU.
• 28th November 2011, 23:40
CUDALIKE
Random memory access is slow on CPU too.
RAM works well in consistently access and in the random access cache adds speed .
Block larger than 1MB will not be able to use almost the L2 cache, and we can only use L3.
Using key -b10 - 100, CPU will not have any benefit in using the cache.

http://www.1024cores.net/home/in-rus...ata-structures
http://www.1024cores.net/home/parall...ous-algorithms