Results 1 to 13 of 13

Thread: AVX2 nibble decoding (SIMD horizontal scan)

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts

    AVX2 nibble decoding (SIMD horizontal scan)

    I have 5 versions, any more ideas?

    https://godbolt.org/z/rq77vM

    uint v5( qword i ) {
    byte LUT[256];
    LUT[0xF2] = 0; LUT[0xA1] = 1; LUT[0xBE] = 2; LUT[0x2F] = 3;
    LUT[0x03] = 4; LUT[0xAF] = 5; LUT[0x52] = 6; LUT[0x76] = 7;
    LUT[0x51] = 8; LUT[0x99] = 9; LUT[0x80] = 10; LUT[0x2C] = 11;
    LUT[0x94] = 12; LUT[0xEA] = 13; LUT[0x9D] = 14; LUT[0x9E] = 15;

    const auto x1 = _mm256_load_si256( (const __m256i* __restrict)&IJ[i][0] );
    const auto y1 = _mm_set1_epi8( 3 );

    const auto x2 = _mm_sub_epi8(y1,_mm_packs_epi16( _mm256_castsi256_si128(x1), _mm256_extracti128_si256(x1,1) ));
    i = _mm_crc32_u64( _mm_cvtsi128_si64(x2), _mm_cvtsi128_si64(x2)+_mm_extract_epi64(x2,1) );
    i = LUT[byte(i)];
    return i;
    }
    Attached Files Attached Files

  2. Thanks (2):

    Mike (24th December 2020),xinix (24th December 2020)

  3. #2
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    976
    Thanks
    266
    Thanked 350 Times in 221 Posts
    Quote Originally Posted by Shelwien View Post
    I have 5 versions, any more ideas?
    Consider sharing the purpose, functional requirements and portability requirements of the function you are writing.

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    > Consider sharing the purpose,

    Like this: https://github.com/nauful/NLZM/blob/.../NLZM.cpp#L401
    SIMD decoding of a nibble symbol, as thread title says.

    > functional requirements

    No branches, no loops, and some interesting new method to extract the information,
    like CRC trick in first post.
    If it could be somehow faster than movebits/tzcnt combination - would be the best, but unlikely.
    But I'm mostly interested in finding rare instructions that could be applied here.
    Intel SIMD is very asymmetric, so there're many unusual instructions.

    > and portability requirements of the function you are writing

    AVX2 is written in thread title. But AVX512 would be also okay, if its interesting.

  5. #4
    Member
    Join Date
    Nov 2013
    Location
    US
    Posts
    166
    Thanks
    39
    Thanked 54 Times in 30 Posts
    Not AVX, but the updated version of what I haven't completed yet:
    Code:
    // NUM_SSE_UPDATES  = (N + 7) / 8
    template<int N> inline uint32 cdf_lookup(CDF<N> &cdf, uint16 freq) {
    #ifdef USE_SSE
        __m128i _f = _mm_set1_epi16(freq + 1);
    
        __m128i _cmp0 = _mm_sub_epi16(cdf._cell[0], _f);
        __m128i _cmp1 = CDF<N>::NUM_SSE_UPDATES >= 2 ? _mm_sub_epi16(cdf._cell[1], _f) : _mm_setzero_si128();
        uint32 idx_mask = _mm_movemask_epi8(_mm_packs_epi16(_cmp0, _cmp1));
    
        if (CDF<N>::NUM_SSE_UPDATES >= 3) {
            __m128i _cmp2 = _mm_sub_epi16(cdf._cell[2], _f);
            __m128i _cmp3 = CDF<N>::NUM_SSE_UPDATES >= 4 ? _mm_sub_epi16(cdf._cell[3], _f) : _mm_setzero_si128();
            idx_mask |= _mm_movemask_epi8(_mm_packs_epi16(_cmp2, _cmp3)) << 16;
        }
    
        return clz32(idx_mask);
    
    #else
        int len = N, off = 0, mid;
        while ((mid = len >> 1) > 0) {
            if (cdf.cell[off + mid] <= freq) {
                off += mid;
            }
            len -= mid;
        }
    
        return off;
    #endif
    }
    The second condition (without SSE) is easier to understand: binary search to find the bucket given ranges. AVX would definitely be faster in the case N >= 16.

  6. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    Yes, that's what I'm talking about.
    But clz/tzcnt is a relatively slow scalar operation (3clk latency on intel), so I keep thinking that there might be a better way, like that minpos instruction.

    Btw, for two-part scanning it may be better to use popcnt instructions.

  7. #6
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    non-simd code should be faster with linear search up to 8-16 elements

    idx_mask = (idx_mask << 16) | _mm_movemask_epi8(_mm_packs_epi16(_cmp2, _cmp3)
    should be one cycle faster

    when NUM_SSE_UPDATES == 2, it should be better to use pand to clear bits. well, this requires use of popcnt as Eugene suggested

  8. #7
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    976
    Thanks
    266
    Thanked 350 Times in 221 Posts
    Quote Originally Posted by Shelwien View Post
    > functional requirements

    No branches, no loops, and some interesting new method to extract the information,
    like CRC trick in first post.
    If it could be somehow faster than movebits/tzcnt combination - would be the best, but unlikely.
    But I'm mostly interested in finding rare instructions that could be applied here.
    Intel SIMD is very asymmetric, so there're many unusual instructions.
    It is an arithmetic coder for 16 symbols? (or prefix/rice/ANS/something else...)

  9. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,593
    Thanks
    801
    Thanked 698 Times in 378 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    It is an arithmetic coder for 16 symbols? (or prefix/rice/ANS/something else...)
    https://encode.su/threads/2206-new-c...odle-1-45-quot

    http://cbloomrants.blogspot.de/2015/...odle-lzna.html

    https://fgiesen.wordpress.com/2015/0...distributions/

  10. Thanks:

    Gotty (30th December 2020)

  11. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    > idx_mask = (idx_mask << 16) | _mm_movemask_epi8(_mm_packs_epi16(_cmp2, _cmp3)

    Note that here you don't really need "|(idx_mask<<16)" - even with BSR its simpler to initialize index register with the right value,
    and tzcnt/popcnt don't have that problem at all.
    Also no need for packs() - its actually better if tzcnt returns *2 value, since you can directly index a 16-bit counter table with that.

  12. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    976
    Thanks
    266
    Thanked 350 Times in 221 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    "LZNA gets its speed from two primary changes. 1. It uses RANS instead of arithmetic coding. 2. It uses nibble-wise coding instead of bit-wise coding, so it can do 4x fewer coding operations in some cases."

    This is still very confusing to me. I always considered all ANS to work with more than one bit at a time. I think it is better to write out the requirements of the work in clear text instead of thinking that everyone will be automatically on the same page with you.

  13. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    > LZNA gets its speed from two primary changes. 1. It uses RANS instead of arithmetic coding.

    There's more hype than proof for this.
    In theory, rANS consists of the same arithmetic operations as RC, just in different order.
    Currently ANS has more optimized SIMD implementations of interleaved coding,
    which are more efficient on modern out-of-order CPUs than old RCs.
    But there's no actual proof of ANS superiority.

    Same for tANS - its main benefit is not being covered by old AC patents,
    but state-machine AC implementations existed forever, and at FSM level
    the theory behind it doesn't really matter - CPU operations are the same.

    > 2. It uses nibble-wise coding instead of bit-wise coding, so it can do 4x fewer coding operations in some cases.

    Yes, but I'm using it with RC actually. I never said anything about ANS in original post.

    > I always considered all ANS to work with more than one bit at a time.

    No, the main difference is that ANS is LIFO, while RC is FIFO,
    arithmetics are similar (especially in binary case).

    Encode a symbol interval [symLow..symLow+symFreq) of [0..symTotal):
    rANS: state' = (state/symFreq)*symTotal+symLow+(state%symFreq);
    RC: low' = low+(range/=symTotal)*symLow; range*=symFreq;

    This can be illustrated with versions of decimal-to-binary conversion:
    s="12345";
    1) while( c=*s++ ) value = value*10+c-'0'; // rANS way, requires LIFO decoding
    2) r=100000; while( c=*s++ ) value += (r/=10)*(c-'0'); // RC way, FIFO decoding is ok

    That is, ANS is a complete equivalent of AC,
    coding of non-binary alphabets is just a speed optimization,
    which is also applicable to AC.

    > I think it is better to write out the requirements of the work in clear text
    > instead of thinking that everyone will be automatically on the same page with you.

    This thread was supposed to be about specific AVX2 implementation of a specific algorithm
    (finding a symbol interval by code value), which was mentioned on this forum many times,
    so getting complaints is pretty strange.

    But I can explain in detail if you have any questions.

  14. Thanks (2):

    Mike (30th December 2020),xinix (30th December 2020)

  15. #12
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    976
    Thanks
    266
    Thanked 350 Times in 221 Posts
    Quote Originally Posted by Shelwien View Post
    Yes, but I'm using it with RC actually. I never said anything about ANS in original post.
    When topic is complicated or advances state-of-the-art, it makes better discussion threads if you give an introduction, then explain what are the requirements or general ideas of your improvement over the state-of-the-art, what is your general planned approach, what things to avoid, etc. -- then add the code example, too.

    Most readers don't read your stand-alone code or try to reverse engineer the requirements from it.

  16. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    4,135
    Thanks
    320
    Thanked 1,397 Times in 802 Posts
    I agree in general, and usually post more introduction text, but this specific topic is in style of https://graphics.stanford.edu/~seander/bithacks.html
    and requires having experience with SIMD programming using intrinsics.
    Explaining it in plain english would be both time-consuming and boring for _all_ readers, so I'd get even less feedback.
    Also I already tried explaining it to you here: https://encode.su/threads/3542?p=67987&pp=1
    But I can try different wording.
    1. SIMD can be used to improve the speed of algorithm implementation.
    2. There's a class of so-called "horizontal" SIMD operations: https://rust-lang.github.io/packed_s...t-hor-ops.html
    3. There's a significant asymmetry in the design of x86 SIMD - there're many instructions like https://www.felixcloutier.com/x86/phminposuw
    that are only defined for some specific vector size and/or element types, and require additional arrangements before using them.
    4. In this thread I'm asking whether I missed some ways of horizontal scan implementation, maybe based on some other obscure intrinsics.
    For example, AVX512 has "reduce" intrinsics, or it may be possible to use something like https://www.felixcloutier.com/x86/pcmpistrm

Similar Threads

  1. Replies: 5
    Last Post: 15th January 2020, 10:50
  2. Nibble entropy encoding
    By JamesB in forum Data Compression
    Replies: 4
    Last Post: 19th October 2018, 18:28
  3. Dropbox DivANS (nibble)
    By Jarek in forum Data Compression
    Replies: 11
    Last Post: 17th September 2018, 05:35
  4. New ASPLOS paper on SIMD FSM's and Huffman decoding
    By Paul W. in forum Data Compression
    Replies: 0
    Last Post: 22nd April 2014, 05:26
  5. Nibble MMC matchfinder
    By Shelwien in forum Data Compression
    Replies: 4
    Last Post: 25th April 2011, 13:21

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •