A friend pointed me at this:
http://arxiv.org/abs/1307.1417
A friend pointed me at this:
http://arxiv.org/abs/1307.1417
Bulat Ziganshin (8th July 2013),Matt Mahoney (9th July 2013),nburns (9th July 2013)
I've been playing with the sample. I've never seen it be faster than divsufsort, but on some files, it's close, especially larger ones, like enwik8. On some smaller files, it's multiple times slower. There are probably some optimizations in divsufsort that this does not have.
It looks promising.
I needed to make some changes to the source code to get it to compile and behave in reasonable ways for testing. I attached my version, in case it helps anyone. There's a Mac OSX binary included.
Last edited by nburns; 9th July 2013 at 00:24.
Bulat Ziganshin (9th July 2013),Matt Mahoney (9th July 2013),willvarfar (9th July 2013)
afaiu, divsufsort cannot multithread. if this one algo can, it would be a great breakthrough
I tried compiling both the original and modified code in Windows using make and g++. In both cases I get a RadixSA.exe which crashes when run. Haven't tried to debug it further.
Anyway, the paper says the algorithm can be run in parallel. But the "simple" version is IMHO worthless. It is to radix sort on the first l characters into buckets, then use another sorting algorithm on the buckets. That would work fine on random data (as they prove) but on real, highly compressible data the bucket distribution would be highly uneven. For example, on a file of all zero bytes, all of the suffixes would be in one bucket.
Thanks. The binary works. I tested it on the artificial files in http://encode.su/threads/1711-Testin...ll=1#post32943
All of the 1 MB files ran in less than 0.3 seconds (2.0 GHz T3200). On the file w8002 (2 copies of a 50 MB random string), process time was 45 seconds and task manager shows 1 GB memory used (10x input size).
Edit: One file gives an error:
d6016 is a 1 MB file containing the repeating 16 byte sequence (00 10 20 30 40 50 60 70 80 90 a0 b0 c0 d0 e0 f0).Code:radixSA.exe d6016 d6016.sa Read 1000000 bytes from file d6016 Computing Suffix Array... 2 [main] radixSA 4404 exception::handle: Exception: STATUS_ACCESS_VIOLATION 719 [main] radixSA 4404 open_stackdumpfile: Dumping stack trace to radixSA.exe.stackdump
Last edited by Matt Mahoney; 9th July 2013 at 04:28.
I think the issue with building on Windows has to do with building a 32-bit binary. Cygwin seems to give you a 32-bit compiler by default.
I've seen random crashes on a few files, so that's probably normal.
I've noticed that it's faster than divsufsort on DNA and protein, which matches the results in the paper. I haven't seen it beat divsufsort on anything else.
Someone should create a competition for suffix sorting, along the lines of http://sortbenchmark.org/. The state of the art doesn't seem to have moved much in the past few years. Suffix sorting is inherently useful, and it's an interesting problem.
libdivsufsort will multi-thread if you compile with -fopenmp
On enwik8 it's slightly slower than divsufsort for me. There doesn't appear to be support for multithreading in the sample program. I guess the parallel version is meant to be an exercise for the reader.
Code:~/src$ radixSA enwik8 SA Read 100000000 bytes from file enwik8 Computing Suffix Array... RadixSA took [11.62s] Writing output to file SA... ~/src$ mksary enwik8 SA enwik8: 100000000 bytes ... 11.0841 sec ~/src$
Last edited by nburns; 9th July 2013 at 22:27.
Bulat Ziganshin (9th July 2013)
As to this:
The catch is here:I've been playing with the sample. I've never seen it be faster than divsufsort, but on some files, it's close, especially larger ones, like enwik8. On some smaller files, it's multiple times slower. There are probably some optimizations in divsufsort that this does not have.
Furthermore, a large set of SACAs are collected in the jSuffixArrays library
[27] under a unified interface. This library contains Java implementations
of: DivSufSort [28] , QsufSort [25], SAIS [11], skew [9] and DeepShallow
[24]. We include them in the comparison with the note that these Java algorithms
may incur a performance penalty compared to their C counterparts.
I would then say MSufSort is parallelizable, not parallel :]MSufSort is fully parallel for both first and second stage but the work isn't complete yet because there aren't enough hours in a day.
The catch is here:
Furthermore, a large set of SACAs are collected in the jSuffixArrays library
[27] under a unified interface. This library contains Java implementations
of: DivSufSort [28] , QsufSort [25], SAIS [11], skew [9] and DeepShallow
[24]. We include them in the comparison with the note that these Java algorithms
may incur a performance penalty compared to their C counterparts.
So, basically, they're comparing C++ to Java. That's a good trick.![]()
I've skimmed through the paper and looks somewhat ridiculous. Most of the technical talk (ie besides comparisons to others and presenting experimental results) is about radix sort which everyone understands and I think lot of people know that for random inputs radix sort will do very well. What they propose is:
- radix sort on first log2 n bits,
- repetitions handling,
- relaxed prefix doubling (ie not a prefix doubling anymore, but looks similar),
They also didn't say what should be parallelized. Instead we can read in the paper:
So the main for loop is not parallelizable as a whole, but instead we can try to parallelize the small steps within. It seems somewhat doubtful for me that those small steps would parallelize well.The for loop traverses
the suffixes from the last to the first position in the text. This order ensures
that after each step, suffix i will be found in a singleton bucket.
The devil is in the implementation details, I think. They are talking about optimizations related to processor cache and probably that's where most of the performance comes from.
So for me there aren't any new useful theory that wasn't present in papers describing the tested SACAs and the algorithm is only interesting if it performs relatively wel - if yes, then we can learn which optimizations bring a healthy gain in practice.