Yesterday, 23:40
Uhm, Wikipedia says it's a general purpose LZ77 paired with Huffman and second order context mixing, but looking at the code it also includes a small static general purpose dictionary: https://github.com/google/brotli/blob/fcda9db7fd554ffb19c2410b9ada57cdabd19de5/c/common/dictionary.c
But I'm probably not the guy to ask.
Meanwhile, I made the counting linear but it's way slower. I tried some performance analyzing tools but nothing conslusive, just a little bit of everything - matching is expensive and sorting is expensive and looking at the bitvector (which I implemented myself) is inefficient.
I'm considering some dp approach where we only count the words once but then there would be quite a complicated process of updating the counts, involving multiple suffix array searches and I'm still debating if it's worth it.
Correct me if I'm wrong but mcm seems to be getting away with counting easily because they only rank matches to the words in the lookahead window. Later it's also using a hashtable of strings, which would be quite inefficient in my case.