Results 1 to 15 of 15

Thread: Counting byte freq., could this be vectorized?

  1. #1
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    16
    Thanks
    17
    Thanked 3 Times in 3 Posts

    Counting byte freq., could this be vectorized?

    Hi,

    I am looking for a way to speed up calculating byte frequency in a stream. Obvious is reading a byte and using it as index to a 0..255 array to increment a counter. In old times we used the xlat opcode

    I know the stuff about mmx,sse and avx. But opcodes repertories in modern intel are huge.

    Do anyone knows a specific opcode or set of them able to, for example, read 4 or 8 bytes to a register and optimize byte counting in parallel?

    Thanks guys!

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,843
    Thanks
    288
    Thanked 1,245 Times in 698 Posts
    Well, I did something: https://godbolt.org/g/8oNkcF
    But compiler actually thinks that its inefficient, and would use scalar code for it without that pragma.

    And really... what you're supposed to have vectors of?
    Reading 2^N bytes to a vector register at once - sure, that's easy.
    But then these bytes may be the same or different.
    And if they're the same, it means incrementing the same counter, so waiting for previous increments to be done.

    Then, its possible to do it like in my code.
    But again, there're no whole vectors, just elements from different vectors (to hide the dependency).

  3. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,843
    Thanks
    288
    Thanked 1,245 Times in 698 Posts
    More efficient vectorization can be done for this: https://godbolt.org/g/Lh9Yzz
    But I doubt that its faster.

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,553
    Thanks
    767
    Thanked 685 Times in 371 Posts
    avx2 vpgather required to retrieve previous counter value

    avx512 vpscatter required to store the updated value back

    both are pretty inefficient, but you can try. w/o vpscatter all you can do is to retrieve old counters with vpgather and then copy data into scalar registers for storing. It looks less efficient than load-load-inc-store sequence of scalar commands

    instead, I suggest you to look up efficient scalar implementations. AFAIK, with static freq[] array, they are as fast as 1 cycle/byte

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,843
    Thanks
    288
    Thanked 1,245 Times in 698 Posts
    Uh, gather/scatter don't solve the problem.
    Byte values can be equal, then increment + scatter won't work properly.

    But I wonder if there's some tricky implementation possible.
    For example, is it possible with binary alphabet (only 0 and 1 as data byte values)?

  6. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,553
    Thanks
    767
    Thanked 685 Times in 371 Posts
    I don't mentioned that since this problem is already solved in your code - you are using one array per lane. So, you can try to compile your code with avx512

    PSHUFB is essentially 8-bit load with 4-bit index, but then you need to store updated values, and get the same problem of conflicts

  7. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,843
    Thanks
    288
    Thanked 1,245 Times in 698 Posts
    Yes, but that's actually why this problem seemed interesting to me - its related to sorting.
    Like, there would be no update collisions if we had to work with sorted byte values.

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,843
    Thanks
    288
    Thanked 1,245 Times in 698 Posts
    Here's an idea: we can count 16-bit freqs instead, then collapse them back to 8-bit freqs.
    like
    for( i=0; i<len; i+=2 ) tbl[ (unsigned short&)data[i] ];

    Does it still count as SIMD? :)

    Collapsing 16-bit freqs to 8-bit also can be vectorized pretty good.
    But simply using 32-bit counters for freqs would use 256k of memory, which doesn't fit into L1 and might be slow because of that?
    So maybe we'd have to use 8-bit counters and stop on overflows?

  9. #9
    Member
    Join Date
    Aug 2015
    Location
    Urbana, IL
    Posts
    159
    Thanks
    10
    Thanked 162 Times in 90 Posts
    @Cyan wrote an article on this issue. In FSE, he is using an unrolled loop with several 8-bit tables. FSE also includes a benchmark program with several algorithms.

  10. Thanks:

    Shelwien (17th August 2018)

  11. #10
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    - Fastest Histogram Construction : scalar, sse, avx2 functions with benchmark
    - using
    _mm512_conflict_epi32 in AVX-512 seems not to be faster than the scalar functions.

  12. Thanks (2):

    anormal (25th September 2018),Shelwien (17th August 2018)

  13. #11
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    16
    Thanks
    17
    Thanked 3 Times in 3 Posts
    hey guys, thanks you very much for the info,
    I was thinking in computing 32bits freqs and then in some way collapse to 8bits as @shelwein said, but I thought the post-processing will make it slower

    I want to experiment with a, not related to compression directly but could be used, file classifier. I've read tons of paper about malware and file autoclassification, there are many ways of doing this, but one pretty easy, afaik, is a n-gram (in this case 1-gram), frequency compare via Naive Bayes classificator or something like that.
    My idea is related to emulation gaming, not malware or compressing, but file binary identification.

    thanks again guys!

    edit: tested TurboHist in 2 different machines, the fastest seems to be the 8_32 variant, using scalars!, great!

  14. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,843
    Thanks
    288
    Thanked 1,245 Times in 698 Posts

  15. Thanks:

    anormal (25th September 2018)

  16. #13
    Member
    Join Date
    May 2017
    Location
    Spain
    Posts
    16
    Thanks
    17
    Thanked 3 Times in 3 Posts
    @shelwien thanks, had a look but i need to read it better, regards!

  17. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,843
    Thanks
    288
    Thanked 1,245 Times in 698 Posts
    It attempts to detect different data types in a file based on frequency tables.

  18. Thanks:

    anormal (25th September 2018)

  19. #15
    Member
    Join Date
    Mar 2015
    Location
    Germany
    Posts
    57
    Thanks
    34
    Thanked 38 Times in 15 Posts
    With additional O(n) memory you should be able to count faster using AVX512: vcompressd can be used to implement a damn fast 1 bit radix sort for bytes. I stumbled upon this while writing code to build the wavelet matrix/tree for bce. My AVX2 code uses shuffle to do the same and can be found under https://github.com/akamiru/bce/blob/...ry/dict_avx2.h (it's performance doesn't pay of for counting only but AVX512 should be A LOT faster due to vcompressd and double the elements each cycle). Sadly I'm not sure where my optimized AVX512 version is if I find it I'll share it (but implementing it should be straight forward). After sorting you simply use a binary search to count.

  20. Thanks:

    anormal (25th September 2018)

Similar Threads

  1. LzTurbo Byte coding vs. Zhuff FSE
    By dnd in forum Data Compression
    Replies: 57
    Last Post: 24th January 2014, 03:25
  2. Traffic compression (1500 byte chunks)
    By blackhaz in forum Data Compression
    Replies: 4
    Last Post: 21st January 2012, 21:44
  3. Byte oriented LZ77 compressor
    By Matt Mahoney in forum Data Compression
    Replies: 21
    Last Post: 30th December 2011, 17:27
  4. Vectorized rangecoder
    By Shelwien in forum Data Compression
    Replies: 5
    Last Post: 16th January 2011, 18:32
  5. Counting 1 or 0?
    By osmanturan in forum Forum Archive
    Replies: 8
    Last Post: 21st January 2008, 20:47

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •