I'm wondering what's a fast way to take input data and identify the more popular aligned "codewords" (say, for example of length 16, 32, or 64 bytes - and always aligned). An existing compression algorithm may very well already cover this, but in that case I'm interested in the core algorithm that can be extracted from it. And of course, sorting the data, counting duplicates, and then sorting again by frequency will give the perfect answer, but that also processes rare words, and has too high of complexity. I'd like to consider this as a kind of preprocess phase before a compressor or I'd like to identify which codewords are common between data sets. I'd like to be able to choose to consider only those words which, together, constitute 30% of the data set, for example. The all-zero codeword may already constitute 5% to 15% of that, and other popular words, though small in number, can also constitute a majority of the data. An example data set is aligned executable data or maybe an aligned database. (Considering shifting strings that may not be aligned is another matter). Maybe some kind of probabilistic algorithm where the popular words gradually make their way to the top of the charts can be considered. This may be considered to be like a fast histogram calculation that needn't give a perfect answer. Also, some *very* fast hash function may be needed; although it wouldn't need to be perfect at all, it seems necessary to condense 32 bytes into 4 bytes perhaps. Maybe even a subset of the codeword can be used for the hash as a further approximation. Any ideas on this would be appreciated.