Thread: Fast 32-byte aligned word histogram calculation

1. Fast 32-byte aligned word histogram calculation

I'm wondering what's a fast way to take input data and identify the more popular aligned "codewords" (say, for example of length 16, 32, or 64 bytes - and always aligned). An existing compression algorithm may very well already cover this, but in that case I'm interested in the core algorithm that can be extracted from it. And of course, sorting the data, counting duplicates, and then sorting again by frequency will give the perfect answer, but that also processes rare words, and has too high of complexity. I'd like to consider this as a kind of preprocess phase before a compressor or I'd like to identify which codewords are common between data sets. I'd like to be able to choose to consider only those words which, together, constitute 30% of the data set, for example. The all-zero codeword may already constitute 5% to 15% of that, and other popular words, though small in number, can also constitute a majority of the data. An example data set is aligned executable data or maybe an aligned database. (Considering shifting strings that may not be aligned is another matter). Maybe some kind of probabilistic algorithm where the popular words gradually make their way to the top of the charts can be considered. This may be considered to be like a fast histogram calculation that needn't give a perfect answer. Also, some *very* fast hash function may be needed; although it wouldn't need to be perfect at all, it seems necessary to condense 32 bytes into 4 bytes perhaps. Maybe even a subset of the codeword can be used for the hash as a further approximation. Any ideas on this would be appreciated.

2. crc or universal hashing

3. If you know that your data has a repeating structure of fixed sized records then you can use the record offset as context without computing a histogram. zpaq -method 6 will usually figure out the record size and model the appropriate contexts. If not, you can specify a model directly. For example, instead of a normal order 0..4 ICM-ISSE chain like -method x0,6ci1,1,1,1m you could include a (offset mod 32) context like -method x0,6c0,32i1,1,1,1m where the initial order 0 ICM is replaced with order 0 plus offset mod 32.

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•