Results 1 to 3 of 3

Thread: Fast 32-byte aligned word histogram calculation

  1. #1
    Join Date
    Jul 2013
    Stanford, California
    Thanked 3 Times in 3 Posts

    Fast 32-byte aligned word histogram calculation

    I'm wondering what's a fast way to take input data and identify the more popular aligned "codewords" (say, for example of length 16, 32, or 64 bytes - and always aligned). An existing compression algorithm may very well already cover this, but in that case I'm interested in the core algorithm that can be extracted from it. And of course, sorting the data, counting duplicates, and then sorting again by frequency will give the perfect answer, but that also processes rare words, and has too high of complexity. I'd like to consider this as a kind of preprocess phase before a compressor or I'd like to identify which codewords are common between data sets. I'd like to be able to choose to consider only those words which, together, constitute 30% of the data set, for example. The all-zero codeword may already constitute 5% to 15% of that, and other popular words, though small in number, can also constitute a majority of the data. An example data set is aligned executable data or maybe an aligned database. (Considering shifting strings that may not be aligned is another matter). Maybe some kind of probabilistic algorithm where the popular words gradually make their way to the top of the charts can be considered. This may be considered to be like a fast histogram calculation that needn't give a perfect answer. Also, some *very* fast hash function may be needed; although it wouldn't need to be perfect at all, it seems necessary to condense 32 bytes into 4 bytes perhaps. Maybe even a subset of the codeword can be used for the hash as a further approximation. Any ideas on this would be appreciated.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Thanked 691 Times in 374 Posts
    crc or universal hashing

  3. #3
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Melbourne, Florida, USA
    Thanked 798 Times in 489 Posts
    If you know that your data has a repeating structure of fixed sized records then you can use the record offset as context without computing a histogram. zpaq -method 6 will usually figure out the record size and model the appropriate contexts. If not, you can specify a model directly. For example, instead of a normal order 0..4 ICM-ISSE chain like -method x0,6ci1,1,1,1m you could include a (offset mod 32) context like -method x0,6c0,32i1,1,1,1m where the initial order 0 ICM is replaced with order 0 plus offset mod 32.

Similar Threads

  1. Fast CRC table construction and rolling CRC hash calculation
    By Bulat Ziganshin in forum Data Compression
    Replies: 10
    Last Post: 27th May 2018, 04:32
  2. Traffic compression (1500 byte chunks)
    By blackhaz in forum Data Compression
    Replies: 4
    Last Post: 21st January 2012, 22:44
  3. Byte oriented LZ77 compressor
    By Matt Mahoney in forum Data Compression
    Replies: 21
    Last Post: 30th December 2011, 18:27
  4. word lists breaking English words into syllables
    By willvarfar in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 27th September 2010, 01:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts