Page 1 of 2 12 LastLast
Results 1 to 30 of 39

Thread: Burrows-Wheeler sort routine

  1. #1
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    55
    Thanks
    0
    Thanked 12 Times in 9 Posts

    Burrows-Wheeler sort routine

    There is a strong chance this is a quite naive post, anyway...
    Looking at compressor sources that involve BWT transformation I always wonder why
    instead of a miscellanea of routines they do not use a simple merge sort, so I wrote
    an optimized version to handle the worst cases as consecutive characters. I hope it
    could be useful.

    Two hints:

    1) if the array to be sorted is a power of 2 the nrm parameter should be set to True,
    otherwise to False;
    2) if you need a stable method change the line 77

    ELSE IF a[m-1]<=a[i] THEN

    into

    ELSE IF a[m-1]<a[i] THEN

    Code:
    PROGRAM Sort;
    CONST
      len=13;
    VAR
      a:ARRAY[0..2*len-1] OF Byte;
      i,j:Word;
      src:Boolean;
    
    PROCEDURE MergeSort(VAR src:Boolean;nrm:Boolean);
    VAR
      dbl,sub,i,j,k,l,m,n,dst,lim:Word;
      x,y,z:Integer;
    BEGIN
      dbl:=len SHL 1;
      sub:=1;
      WHILE sub<len DO
        BEGIN
          i:=0;
          j:=sub;
          IF src THEN
            BEGIN
              Inc(i,len);
              Inc(j,len);
              k:=dbl;
              dst:=0;
            END
          ELSE
            BEGIN
              k:=len;
              dst:=len;
            END;
          n:=sub SHL 1;
          WHILE i<k DO
            BEGIN
              l:=i+sub;
              m:=j+sub;
              lim:=dst+n;
              IF nrm THEN
                BEGIN
                  x:=sub;
                  y:=sub;
                  z:=n;
                END
              ELSE
                BEGIN
                  IF src THEN
                    BEGIN
                      IF l>dbl THEN
                        l:=dbl;
                      IF m>dbl THEN
                        m:=dbl;
                      IF lim>len THEN
                        lim:=len;
                    END
                  ELSE
                    BEGIN
                      IF l>len THEN
                        l:=len;
                      IF m>len THEN
                        m:=len;
                      IF lim>dbl THEN
                        lim:=dbl;
                    END;
                  x:=l-i;
                  y:=m-j;
                  z:=m-i;
                END;
              IF a[l-1]<=a[j] THEN
                BEGIN
                  IF z>0 THEN
                    Move(a[i],a[dst],z);
                  i:=l;
                  j:=m;
                  dst:=lim;
                END
              (* Solo < per avere un metodo stabile *)
              ELSE IF a[m-1]<=a[i] THEN
                BEGIN
                  IF y>0 THEN
                    Move(a[j],a[dst],y)
                  ELSE
                    y:=0;
                  IF x>0 THEN
                    Move(a[i],a[dst+y],x);
                  i:=l;
                  j:=m;
                  dst:=lim;
                END;
              WHILE dst<lim DO
                BEGIN
                  x:=a[i];
                  y:=a[j];
                  IF (j>=m) OR ((i<l) AND (x<=y)) THEN
                    BEGIN
                      a[dst]:=x;
                      Inc(i);
                    END
                  ELSE
                    BEGIN
                      a[dst]:=y;
                      Inc(j);
                    END;
                  Inc(dst);
                END;
              Inc(i,sub);
              Inc(j,sub);
            END;
          sub:=n;
          src:=NOT src;
        END;
    END;
    
    BEGIN
      Randomize;
      FOR i:=0 TO len-1 DO
        a[i]:=Random(256);
      FOR i:=0 TO len-1 DO
        Write(' ',a[i]);
      WriteLn;
      src:=False;
      MergeSort(src,False);
      IF NOT src THEN
        j:=0
      ELSE
        j:=len;
      FOR i:=j TO j+len-1 DO
        Write(' ',a[i]);
    END.
    Last edited by Marco_B; 23rd August 2016 at 14:32.

  2. #2
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    All that does is sort the array. That isn't the BWT.

  3. #3
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    The reason why there are special algorithms for building the suffix array/burrows wheeler transform is that merge sort takes worst case O(n ^ 2 * log n) steps while building the suffix array via e.g. induced sorting only takes O(n) steps. This means that for large amounts of data merge sort will become incredibly slow while induced sorting only needs time proportional to the size of the text.

  4. #4
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    55
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #2:
    Absolutely yes. I am sorry whether from the post it seems otherwise.

  5. #5
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    55
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #3:
    I do not know so much about the suffix array construction to value on it time complexity. But in ordering an array If I am not wrong the worst case of merge sort is O(n*logn), without the quadratic term on n. Anyway on my algorithm the lines from 68 to 88, if the array is already sorted or consist of a sequence of repeated characters, it takes 3/2*n confrontations and n fast movements.
    Last edited by Marco_B; 19th October 2015 at 10:59.

  6. #6
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    You are right that sorting an array takes O(n * log n) comparisons using merge sort but comparing to string of length n may take up to O(n) time. This means that it takes O(n * log (n) * n) time for sorting the suffix array using merge sort. Modern SACAs (suffix array construction algorithms) make use of the fact that they only compare rotations of the same string to achieve optimal O(n) time and O(n) space.

  7. #7
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    55
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #6:
    Thank you for the explanation. Then in a SACA does it is possible to use a merge sort or any other algorithm as a backend?
    Or the construction phase is strictly coupled with the sorting?

  8. #8
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    In most SACAs I know some type of radix sort is used to get O(n) time. E.g. putting lms substrings into their bucket when using induced sorting (see "Linear Suffix Array Construction by Almost PureInduced-Sorting" by Nong, Zhang and Chan) or using least significant digit radix sort in KS ("Simple Linear Work Suffix Array Construction" by Kärkkäinen and Sanders). The result is usually used to reduce the problem to a string of length O(n / 2) which is then solved recursivly. After solving the reduced problem it is possible to solve the complete problem in O(n) time. This results in O(n + n/2 + n/4 + n/8 ...) = O(2 * n) = O(n) worst case time.
    Last edited by Christoph Diegelmann; 19th October 2015 at 17:47. Reason: Spelling

  9. #9
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    55
    Thanks
    0
    Thanked 12 Times in 9 Posts
    Thank you again to point me into the right direction and I am sorry for my laziness.

  10. #10
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    No problem

    Currently the fastest implementations are actually O(n * log n), see libdivsufsort (https://github.com/y-256/libdivsufsort/) or msufsort (http://www.michael-maniscalco.com/index.htm). Yuta Moris sais-lite(https://sites.google.com/site/yuta256/sais) is an actual O(n) suffix sort but is usally a little bit slower.

  11. #11
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    I have a summary of BWT and various suffix sorting algorithms at http://mattmahoney.net/dc/dce.html#Section_55

    Fast suffix sorting is complex (and I don't understand all of it), but the general idea is that once you determine the order of two long identical strings, then you don't have to compare the suffixes of those strings because they are all in the same order. But how to apply this knowledge isn't obvious. Burrows and Wheeler first described their (also non-obvious) transform in 1996, but it wasn't until 2003 or so that people started figuring out how to suffix-sort highly redundant data efficiently. Early implementations like bzip2 and stuffit used tricks to avoid long matching strings. bzip2 applies RLE prior to BWT. Early stuffit implementations would flip bits to break up long matches. Both of these hurt compression slightly, and RLE isn't always effective anyway. Some programs eliminate long matches with LZP or LZ77 before BWT.

    Divsufsort is fast and open source (MIT license), so I just use that in zpaq instead of writing my own code.

  12. #12
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Fast suffix sorting is complex (and I don't understand all of it), but the general idea is that once you determine the order of two long identical strings, then you don't have to compare the suffixes of those strings because they are all in the same order.
    It seems like it might be better to go backwards, because if a and b are adjacent in the sorted order, then xa and xb are also adjacent. I had an idea for how to sort that way, but I haven't implemented it. I think divsufsort is quite good and would be hard to beat.

  13. #13
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    @nburns I haven't read much about KS (because from benchmarks it seemed to be slower) but SAIS is based on this effect.

    I don't know how to exlain it correctly without going into detail too much but basically SAIS selects a special number of suffixes
    sorts them and is then able to sort all other suffixes in constant time using this information (In backwards direction like you said).
    If anyone is really interessed in it I could try to write a small article but I guess it would get any better than simply reading the papers.
    Last edited by Christoph Diegelmann; 20th October 2015 at 19:27.

  14. Thanks:

    nburns (21st October 2015)

  15. #14
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    55
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msgs #11,12:
    I think what reported by Mahoney and quoted by nburns is the key concept, thank you to clarify to me. If I understand the central computational problem is to find the common prefix among three strings.

  16. #15
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,257
    Thanks
    307
    Thanked 795 Times in 488 Posts
    A while back I compared SAIS and Divsufsort (both written by Yuta Mori) in zpaq and Divsufsort was faster in my tests.

  17. #16
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 72 Times in 56 Posts
    Quote Originally Posted by Christoph Diegelmann View Post
    @nburns I haven't read much about KS (because from benchmarks it seemed to be slower) but SAIS is based on this effect.
    I have read a little bit about SAIS, so I probably got the idea from there.

  18. #17
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    Quote Originally Posted by Matt Mahoney
    Fast suffix sorting is complex (and I don't understand all of it), but the general idea is that once you determine the order of two long identical strings, then you don't have to compare the suffixes of those strings because they are all in the same order. But how to apply this knowledge isn't obvious.
    That's exactly where I fail. The general idea is understandable but how to use this knowledge to write an implementation is a puzzle to me. The longer the common prefix/suffix is, the slower is my implementation compared to thouse of Mr. Malyshev, Mr. Mori, Mr. Maniscalco and others who understood how to use the redundancy. Can someone explain how to use the redundancy to improve the sorting speed in a string like this:

    ...somelongcommonprefixA...somelongcommonprefixZ.. .

    Thank you.

    From what I understood so far is that there seems to be a good way to integrate it into multi key quicksort. But I have no idea how.
    Last edited by just a worm; 21st October 2015 at 20:10.

  19. #18
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    140
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by just a worm View Post
    That's exactly where I fail. The general idea is understandable but how to use this knowledge to write an implementation is a puzzle to me. The longer the common prefix/suffix is, the slower is my implementation compared to thouse of Mr. Malyshev, Mr. Mori, Mr. Maniscalco and others who understood how to use the redundancy. Can someone explain how to use the redundancy to improve the sorting speed in a string like this:

    ...somelongcommonprefixA...somelongcommonprefixZ.. .

    Thank you.

    From what I understood so far is that there seems to be a good way to integrate it into multi key quicksort. But I have no idea how.

    Dmitry's solution borrows from Manzini's 'Deep/Shallow' techniques if I remember correctly. I can't really speak to this solution but I can describe how MSufSort works and, to the extent where Yuta's work borrows from MSufSort, I can speak about that as well.

    There are three key parts to the algorithm:

    1. The 'improved two stage' which Yuta documented here: http://homepage3.nifty.com/wpage/software/itssort.txt (Introduced by Yuta Mori)
    2. Induction sorting, typically referred to as 'Induced Sorting'. Introduced in many algorithms. Introduced in MSufSort as a core feature since the first version in the late 1990's
    3. The tandem repeat sort. Intrroduced in MSufSort 2.0 and perfected in 3.0.

    #3 is the most important part. Without it there is no way to avoid catastrophic cases for these direct comparison suffix sorting algorithms. This is one of the reasons for my Gauntlet Corpus, found here: http://www.michael-maniscalco.com/download.htm.

    There are also a collection of other improvements which more likely fall under the category of 'better engineering' rather than 'better algorithms' but I'll describe those as well. Some of them are:

    4. First stage ITS directly to BWT (rather than completing the second stage of the ITS first). Introduced with MSufSort
    5. Cache friendly second stage ITS (also used in #4)
    6. Three pivot quicksort (alternatively, there is also Tarsa's two pivot quicksort, if you like. But I've found the three pivot to have an advantage when used for multikey quicksort.)
    7. There are probably many others but I'll stick with these for now.

    So, to describe these key features:

    #1 - The "Improved Two Stage". I'll leave you to read Yuta's description as provided in the link. He's description is very concise and I doubt I could explain it more clearly.

    #2 - "Induction Sorting". The idea of 'induced sorting' has been around for a long time and appears to be a key feature in the new linear algorithms. It's been in MSufSort since the beginning and is a key feature.
    The concept is to exploit the fact that suffix sorting is sorting substrings rather than independent strings. Here's an example:

    SORTSORTSORT

    Let's assume, for simplicity, that we sort these 12 suffixes in lexicographical order. That is, we sort the suffixes which begin with 'O' first, then those which begin with 'R' and then 'S' and then 'T'. If we sort the 'O' suffixes before we deal with any other suffixes then, after sorting them we would have their final sorted order available to us when we begin sorting the 'R' suffixes next. So, after sorting the 'O' suffixes we would have a partial suffix array as follows

    SORTSORTSORT
    -2---1---0--

    So then we begin sorting the 'R' suffixes next, we will see that all three 'R' suffixes are similar for the first three characters .. 'RTS'. But the key point is that we never need to compare these three suffixes at 'O' because we have already established the final sorted order of all 'O' suffixes. So, rather than blinding continuing to compare the 'R' suffixes directly, once they encounter 'O' suffixes we can switch to comparing the sorted order of the 'O' suffixes. So it doesn't matter how long the common length of the 'O' suffixes are because we are going to compare against the established sorted order of the 'O' suffixes and not the 'O' suffixes directly. Therefore, the 'R' suffixes will be complete sorted after establishing a common sorted order of 'RTS' and then comparing the values 0, 1, and 2. This is called 'induction sorting' in MSufSort. Induction sorting is pretty much the same concept as 'Induced Sorting' which is what #1, "Improved Two Stage" is doing. However, it is more cache friendly and quite easy to fold into a multikey quicksort.

    #3 - the "Tandem Repeat" sort. This is the key invention introduced with MSufSort 2 that makes MSufSort (and eventually DivSufSort) superior to the 'doubling' algorithms of the same era. Tandem repeats are a repeat of characters side by side in the string. The repeat can be of any length and a 'run' of a single symbol is simply a tandem repeat with a length of one. Examples of tandem repeats are: 'AAAAAAAA' and 'ABCDABCDABCD'. In the simple case all suffixes in a tandem repeat can be sorted according to the sorted order of the last two suffixes in the repeat and, in this regard, is similar to induction sorting. For any tandem repeat of length N there are N suffixes denoted as N[i] with i ranging from 0 to N-1. So the tandem repeat 'BBBB' would be defined as:

    BBBB <- tandem repeat
    0123 <- N[i] with i = 0-3

    The key here is to sort the terminating repeats only. If repeat [N - 1] sorts as lexicographically 'less' than [N - 2] then the sorted order of all repeats will appear as N-1,N-2,N-3, ... ,0. Example: 'BBBBA' will be sorted as '3,2,1,0' for the B suffixes. However, if repeat [N - 1] sorts as lexicographically 'greater' than [N - 2] then the sorted order of all tandem repeats will be the opposite. Example: 'BBBBC' will be sorted as '0,1,2,3' for the B suffixes.

    So, it is important to recognize when a series of similar suffixes are also a series of tandem repeats, as early as possible. At that point, you switch to the tandem repeat sort and sort only the trailing suffixes and, once sorted, you induce the sorted order of the other suffixes involved in the tandem repeat. This is true, even for tandem repeats which are a run of the same character such as 'AAAAAAAA' etc.

    The tandem repeat sort gets more complicated, however, when there the string contains multiple, positionally disparate but otherwise identical tandem repeats. Example: 'ABCABCABC1...ABCABCABC2...ABCABCABC3...' Here we have nine 'A' suffixes which comprise three tandem repeats of three 'A' suffixes each. In this case, we need to isolate the trailing 'A' suffix in each tandem repeat and, before completing the tandem repeat sort for each tandem repeat, we much sort these trailing 'A' suffixes relative to each other first. Therefore, first we recognize the three tandem repeats for the 'A' suffixes as:

    ABCABCABC1...ABCABCABC2...ABCABCABC3...
    ......^............^............^......

    Then we sort these three 'A' suffixes as we normally would with any suffix array algorithm. And, once we have their sorted order relative to each other, we induce the sorted order of the other tandem repeats as interleaved across the otherwise positionally disparate tandem repeats. That is:

    ABCABCABC1...ABCABCABC2...ABCABCABC3...
    ......0............1............2...... <- 0,1,2 are actually directly sorted

    ABCABCABC1...ABCABCABC2...ABCABCABC3...
    ...3..0.........4..1.........5..2...... <- 3,4,5 are induced from 0,1,2

    ABCABCABC1...ABCABCABC2...ABCABCABC3...
    6..3..0......7..4..1......8..5..2...... <- 6,7,8 are induced from 3,4,5

    I'm sure this is not going to be clear when you first read this so think closely about what the above illustration shows and I think you will get it.

    I will go into points #4,5,6 at some future date. Again, those are engineering/architectural improvements rather than algorithmic improvements and they involve cache friendly optimizations rather than mathematically *better* algorithms. But I'm out of steam at the moment and don't have any more time at the moment.

    I hope this has helped. Note, that I wrote this in a fixed width font so that the various line in the examples would line up in columns. If you see this in a variable width font (highly likely) then the examples won't line up cleanly in columns as might not make sense.

    - Michael



    Last edited by michael maniscalco; 22nd October 2015 at 04:55.

  20. Thanks (4):

    Bulat Ziganshin (29th October 2015),hexagone (23rd October 2015),Matt Mahoney (22nd October 2015),Turtle (23rd October 2015)

  21. #19
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    Well, I understood the issue which was described pretty good by Mr. Hideo Itoh and Mr. Hozumi Tanaka (September 1999). It was improved later by Mr. Tsai-Hsing Kao (2001) which allows a more memory efficient and faster way. Till here I think thouse concepts are a step forward in sorting suffixes/prefixes/rotations. It can be more memory efficient because Mr. Kao showed a way how it's guaranteed that you only need to sort a maximum of 1/2 (needs a slight change of the concept) of the suffixes/prefixes/rotations which means that you can use an implementation which uses only 2 N memory instead of 4 N memory for the suffix list. This still asks for a good trick when you complete the other suffixes but the procedure only emitts a maximum of 1 new 4 byte position/suffix while going through the suffix list and discarding 1 position/suffix each iteration. Based on the work from Mr. Itoh and Mr. Tanaka later Mr. Yuta Mori came up with the improved-2-stage sort (2005) which partitially uses the work from Mr. Kao because Mr. Pang Ko and Mr. Srinivas Aluru use it (2003). But the memory requirement by Mr. Mori's improved-2-stage sort is somewhat high in my opinion (2 * 256 * 256 * 4 bytes) and because of the necessary time to initialize the 2 lists with zeros it's inefficient for small data.

    But thouse other 2 main points you listed (tandem repeats and induction sorting) are still difficult to implement (in my opinion). Are the sorting of tandem repeats and the induction sorting 2 similar things? I think that even if I could distinguish the 2 techniques clearly then I would still have problems to write an implementation.

    Are there other ways to use the redundancy? But I think that it's sad to see if good ideas can't be adopted because they are not good documented or too complicated, even thou the general idea is more or less comprehensible.
    Last edited by just a worm; 22nd October 2015 at 22:11.

  22. #20
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    55
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #18:
    I found very useful your explanation and incredibly I feel to start to understand the induced sorting.
    But thinking on a possible implementation I stumbled on the following point: stated that the order of
    the suffixes of a string is not strictly equivalent to that of the rotations of the BWT

    string
    CABB

    suffixes
    0) CABB
    1)ABB
    2)BB
    3) B

    BWT
    0) CABB
    3) BCAB
    2) BBCA
    1) ABBC

    suffixes order 1,3,2,0
    BWT order 1,2,3,0

    in the case of the repeated string as AAAA, ordering from the last suffix it allows to induce the sorting
    recognizing that a suffix incorporates the next, but for the BWT it is needful to compare all the characters,
    and so the running time is still quadratic. Where do I am wrong?

  23. #21
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    You can generate the BWT using the SACAS in 2 ways:

    First Way: Add a single unique character which is smaller than any other to the end of the string. Then sorting the suffixes and sorting the rotations of this new string is the same.

    Second Way (I'm using it on bce because it can only handle a binary alphabet): Find the minimal string rotation (by e.g. using lydon factorization), sort it using a SACA. This generates exactly the same as sorting all rotations of the orginal string because a SACA assumes the last char to be the lexicografically smallest. If the first char in your string is the smallest and you sort rotations this assumption is correct and you get the correct result.

  24. #22
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    Add a single unique character which is smaller than any other to the end of the string. Then sorting the suffixes and sorting the rotations of this new string is the same.
    I would like to add that this way is easy but increases the alphabet size by 1. Normally your alphabet has a size of 256 characters. With this way your alphabet has a size of 257 characters so you need more than 1 byte per character. A different solution does not need this additional memory by checking every time wether you have reached the end of the string. But this needs CPU-time.
    Last edited by just a worm; 25th October 2015 at 05:33.

  25. #23
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    140
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by just a worm View Post
    I would like to add that this way is easy but increases the alphabet size by 1. Normally your alphabet has a size of 256 characters. With this way your alphabet has a size of 257 characters so you need more than 1 byte per character. An different solution does not need this additional memory by checking every time wether you have reached the end of the string. But this needs CPU-time.
    Dealing with the sentinel is trivial and avioding additional memory during suffix array construction is also fairly simple. Dealing with string compares which include the sentinel will vary from SACA to SACA depending on the algorithm but I've always found it to have no impact on performance. Certainly no more so than handling wrap around would be.

    Adding the sentinel at the end of the input is absolutely the best way to go.
    Last edited by michael maniscalco; 24th October 2015 at 20:27.

  26. #24
    Member just a worm's Avatar
    Join Date
    Aug 2013
    Location
    planet "earth"
    Posts
    96
    Thanks
    29
    Thanked 6 Times in 5 Posts
    Would you mind sharing how to avoid the additional memory requirement while not needing to check wether the comparison has reached the end-of-string character? If it depends on the algorithm then let's assume that we are sorting all suffixes with a simple insertion sort.

    How does the comparison work?
    ABC$
    ABCDABC$

    Thank you. This would be helpful.
    Last edited by just a worm; 25th October 2015 at 05:36.

  27. #25
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    55
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #21:
    I would say it seems pure magic, but sure you shut up me with a tremendous demonstration Thank you again.

  28. #26
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Ok then let's take your example of CABB:

    Way 1: Add a unique minmal character to the end:

    CABB$

    Sort all it's suffixes and all it's rotations:

    Suffix:
    $
    ABB$
    B$
    BB$
    CABB$

    Rotations:
    $CABB
    ABB$C
    B$CAB
    BB$CA
    CABB$

    As you can see the the rotations and suffixes are now exactly the same order.

    Way 2: Rotate to lexicographically minmal string rotation and assume a unique minimal character

    step 1: lexicographically minmal string rotation
    CABB -> ABBC

    step 2: sort

    Suffix:
    ABBC
    BBC
    BC
    C

    Rotations:
    ABBC
    BBCA
    BCAB
    CABB

    As you can see they are again in exactly the same order
    Last edited by Christoph Diegelmann; 26th October 2015 at 13:32. Reason: Tables don't work ...

  29. #27
    Member
    Join Date
    May 2015
    Location
    Italy
    Posts
    55
    Thanks
    0
    Thanked 12 Times in 9 Posts
    To msg #26:
    You win Seriously, I am convinced now.

  30. #28
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    Thanks A small note left about how to reverse the second version: The regular BW-Offset will always be zero after the rotation so you don't need to save this. But you need to save the position where the minimal string rotation was in the original string to reconstruct it.

  31. #29
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts

    a small note

    The result is nothing more than the BWTS of the rotated file. Except that if you did the BWTS of the original file you would not need the offset at all if doing the BWTS. When doing the lyndon thing your half way there why rotate the file at all.
    The BWTS and your BWT are always the same for at least one rotation of the original file. The big advantage is now you have a bijective transform instead of one where an index is required.

  32. #30
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    I waited for you

    While BWTS is good for generic BWT compression (awesome if you want to encrypt after) it destroys some contexts which might hurt compression for context aware BWT compressors like bce or m03.
    Moreover BWTS can't be used for FM-Indices.

Page 1 of 2 12 LastLast

Similar Threads

  1. who invented insertion sort?
    By just a worm in forum The Off-Topic Lounge
    Replies: 4
    Last Post: 16th March 2015, 15:38
  2. Radix sort match finder
    By Conor in forum Data Compression
    Replies: 17
    Last Post: 13th February 2015, 13:19
  3. dcs-bwt - Experimental Burrows-Wheeler Compressor
    By Arkanosis in forum Data Compression
    Replies: 2
    Last Post: 25th June 2008, 03:53

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •