In my case, we introduced an additional element that makes the code much faster than the original method: in the original algorithm, you use only 1 dynamic compressed string, which length is equal to the length of the text = N; it follows that the height of the trees used internally in the dynamic strings is log N, so overall the algorithm has N log N running time. In our case, we instead partition the BWT using length-k-contexts (k is the entropy order and is approximately log N), and for each of them we mantain a dynamic string. The number of contexts is very high (asymptotically N/polylog(N)) , so on average (assuming uniform random input text) each context has length polylog(N), and the height of the trees used internally in the dynamic strings is O(1) (constant). Overall, this leads to a (average) O(N) time algorithm instead than a O(N log N). In practice, this is confirmed: the algorithm is not as fast as SA-based algorithms, but is considerably faster than algorithms that use only 1 dynamic string for the entire BWT.