It's "Data Parallel Finite-State Machines" by Mytkowicz, Musavathi, and Schulte
https://research.microsoft.com/pubs/...-mytkowicz.pdf
They argue that lots of (most?) FSM's are amenable to substantial (but not huge) speedup with SIMD processing, due to convergences in the state transition network limiting the amount of wasted speculative execution you have to do.
One of their examples is tokenizing HTML, for which they get a speedup of 2.3.
They also compare an optimized non-SIMD Huffman decode with a SIMD one, getting single-core speedups of about 2---from 300+ MB/sec to 600+ MB/sec on a Intel 2.67GHz Xeon (X5650). (16x bytes SIMD.)
I'm not qualified to evaluate this paper, but I thought it looked interesting in terms of both scanning strings and encoding.