Page 1 of 2 12 LastLast
Results 1 to 30 of 42

Thread: AVX-512 and interleaved rANS

  1. #1
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts

    AVX-512 and interleaved rANS

    I've been experimenting on an AVX-512 system (64 core x4 hyperthread Knight's Landing box). It's an interesting platform that poses a few nuances.

    My rans_static4x16 code uses branchless renormalisation. In theory it's just a ?: operator, but we ended up going for asm cmov instructions because these weren't being generated by many compilers. It does 16 bit renormalisation and interleaves 4 rANS codecs together into the same compressed buffer. As such, on my i5 system this often runs around 3.5 instructions per cycle, but doesn't use SIMD anywhere. This is due to the rans state dependence on the previous cycle and also the dependence on the same compressed data pointer between the 4 rans states in the same cycle.

    On my desktop this decodes around 660MB/s. On the KNL system it was more like 130MB/s. (For reference KNL fse was around 110MB/s and fsehuf around 220MB/s decode.)

    The core decoding loop is:
    Code:
        for (i=0; i < out_end; i+=4) {
            RansState m[4];
            m[0] = R[0] & mask;
            R[0] = sfreq[m[0]] * (R[0] >> TF_SHIFT) + sbase[m[0]];
    
            m[1] = R[1] & mask;
            R[1] = sfreq[m[1]] * (R[1] >> TF_SHIFT) + sbase[m[1]];
    
            m[2] = R[2] & mask;
            R[2] = sfreq[m[2]] * (R[2] >> TF_SHIFT) + sbase[m[2]];
    
            m[3] = R[3] & mask;
            out[i+0] = ssym[m[0]];
            out[i+1] = ssym[m[1]];
            out[i+2] = ssym[m[2]];
            out[i+3] = ssym[m[3]];
            R[3] = sfreq[m[3]] * (R[3] >> TF_SHIFT) + sbase[m[3]];
    
            RansDecRenorm(&R[0], &cp);
            RansDecRenorm(&R[1], &cp);
            RansDecRenorm(&R[2], &cp);
            RansDecRenorm(&R[3], &cp);
        }


    Switching from the branchless renormalisaton to a simple branched structure ( if (x<RENORM) x=(x<<16)|*ptr++; ) bumped that up to 170MB/s instead. A huge leap on this platform (and a huge drop in some data sets on my desktop - hints welcomed on how to handle this!). This appears to be because the modified silvermont chip that KNL is based on has, I'm guessing, just 2 execution ports as typically it was only running 1.5 instructions per clock cycle. Hence the predicted branch miss penalty is much smaller than on the i5. This changes the tradeoff of more instructions vs more failed branches.

    I was thinking though about true SIMD. To do this, we need the separate rANS states to be completely independent. This means N states and N buffers, although 1 shared output buffer can avoid a scatter instruction. In code it's now:

    Code:
        for (i=0; i < out_end; i+=NX) {
    #pragma omp simd
            for (z = 0; z < NX; z++) {
              uint32_t S = s3[R[z] & mask];
              uint16_t f = S>>(TF_SHIFT+8), b = (S>>8) & mask; uint8_t s = S;
    
              R[z] = f * (R[z] >> TF_SHIFT) + b;
              out[i+z] = s;
    
              uint32_t y = (R[z] << 16) | *spN[z];
              spN[z] += R[z] < RANS_BYTE_L ? 1 : 0;
              R[z]    = R[z] < RANS_BYTE_L ? y : R[z];
            }
        }
    Unfortunately this is just much slower than the previous code, until we get NX (a compile time #define) to be high. NX=16 and above give similar results, presenting 512 bits worth of rANS state. All SIMD is done automatically by the compiler... well by icc anyway.

    The s3[] array is freq, base and symbol arrays (all indexed on R[z] & mask) squashed into one array. This avoids 3 separate emulated-gathers, replacing them by 1 (plus the other common gather on spN[]). The spN[] array is the pointer to the compressed state per rans codec and replaces the single 'cp' of the previous algorithm. The omp stuff is because icc simply doesn't generate good code unless it's there - I think it changes assumptions on alignment amongst other things. Gcc 6.2 completely fails to vectorise this as it identifies gather as being poor and then gives up vectorising entirely, while icc identifies gather as being poor compared to separate loads and switches to "emulated gather" using loads instead. It needs icc -fopenmp -xMIC-AVX512 as -march=native doesn't generate AVX512 for some unknown reason (icc 16.0.2).

    With NX=64 this code is now decoding at around 330MB/s, close to double the previous peak and faster even than fsehuf. However it's specialist hardware with an algorithm specifically written to exploit AVX512 registers.

    However the disappointing result is that building for core-avx2 gets me code that only runs slightly slower than the previous non-SIMD algorithm as everything it gains in SIMD it loses in running fewer instructions per cycle, and SSE4 is even slower. So it gives me a file format that doesn't entirely work well across all chips. Still, it's an interesting development and worthy of more exploration. Has anyone else been having success with SIMD entropy encoding?

    Hopefully I'll identify why on an i5 the avx2 implementation of the above isn't any faster than the non-simd variant. It's clearly bottlenecking on something, perhaps memory fetches on the spN and s3 array loads. I'll look into adding prefetching. Fun fun, although I've only got a day or two left on available on this hardware.

    (Code at the moment is in a state of flux, but will be public soon I hope.)

    Edit: one slight modification to the above is to store *spN[z] in a temporary variable after the S=s3[...] load, causing an interleaving of gathers and doing useful work while waiting on the first one to return the results. I think we can possibly try and interleave more here too, or maybe even have an entire vector of previously gathered data. (The prefetch is still preferable, but it requires a complete rewrite to intrinsics I think.) It seems to make little difference on AVX-512, but the core-avx2 binary running on my i5 is now decoding at 739MB/s for low entropy data and 571MB/s for high entropy data, making it better and/or worse than before depending on inputs.
    Last edited by JamesB; 22nd September 2016 at 01:48.

  2. Thanks (3):

    Cyan (22nd September 2016),Jarek (22nd September 2016),willvarfar (22nd September 2016)

  3. #2
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    If you get a gather for spN, that's really not ideal (context: I spent a year and a half working on the rendering stack for KNF/KNC before they got rebranded as Xeon Phi. I'm fairly intimately familiar with the ISA, though admittedly more the L1OM/K1OM versions than the AVX-512 that's in KNL). This should have exactly one gather in the core loop. The spN load absolutely, positively should be a VEXPANDD, not a gather, and you should have one interleaved stream, not N of them!

    Also, for AVX-512, you should not write code in a standard branchless way. AVX-512 has predication on everything. Using branchless constructs may often make things slower than just writing flow control that can get flattened to predication.

    The code should look something like this (using the new intrinsics, which I'm not all that familiar with since as said, most of my work with this was using earlier iterations):

    Code:
      __m512i R; // RANS state vector
      __m512i maskv = _mm512_set1_epi32(mask); // constant (loaded once)
    
      // do 16 at once:
      __m512i masked = _mm512_and_epi32(R, maskv); // VPAND
      __m512i S = _mm512_i32extgather_epi32(masked, s3, _MM_UPCONV_EPI32_NONE, sizeof(*s3), 0); // VPGATHERDD
      __m512i f = _mm512_srli_epi32(S, TF_SHIFT+8); // VPSRLD
      __m512i b = _mm512_and_epi32(_mm512_srli_epi32(S, 8), maskv); // VPSRLD + VPAND
      // no need to extract s by hand
    
      R = _mm512_add_epi32(_mm512_mullo_epi32(_mm512_srli_epi32(R, TF_SHIFT), f), b); // VPSRLD, VPMULLD, VPADDD
      _mm_storeu_si128((__m128i *)out, _mm512_cvtepi32_epi8(S)); // mask low bytes of s, packs (down-conv), stores; 1 VPMOVDB with mem operand
    
      // renorm. this is the interesting part:
      __mmask16 renorm_mask = _mm512_cmplt_epu32(R, _mm512_set1_epi32(RANS_BYTE_L)); // VPCMPUD
      __m512i renorm_words = _mm512_cvtepu16_epi32(_mm256_loadu_si256((const __m256i *) sp)); // fetch next 16 words from sp; PMOVZXWD
      __m512i renorm_vals = _mm512_maskz_expand_epi32(renorm_mask, renorm_words); // move words to the right place; VEXPANDD
      R = _mm512_mask_slli_epi32(R, renorm_mask, R, 16); // shift only lanes that need renorm; VPSLLD
      R = _mm512_add_epi32(R, renorm_vals); // add in the newly read words; VPADDD
      sp += _mm_popcnt_u32(renorm_mask); // advance by however many words we actually read; MOV to get mask + POPCNT + LEA
    This uses a "regular" 16-stream interleaved layout, not N separate streams. Ought to be much better than reading from 16 (or even more!) buffers in parallel and eating all the cache headaches that causes.

  4. Thanks (4):

    Bulat Ziganshin (22nd September 2016),JamesB (22nd September 2016),Jarek (22nd September 2016),willvarfar (22nd September 2016)

  5. #3
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Also note that this is the exact same algorithm as the one already in the SSE decoder in ryg_rans (https://github.com/rygorous/ryg_rans...s_word_sse41.h), just 16-wide. (Which, as a historical note, is actually backwards - the SSE4.1 ryg_rans decoder was in fact derived by first writing it for KNF/L1OM and then translating the unsupported VLOADUNPACK, which is sort-of-but-not-quite equivalent to VEXPANDD, into a PSHUFB+table, and the gathers into manual loads).

    The biggest problem with going CPU-independent here is that in an interleaved stream, the largest targeted SIMD width affects the layout. That said, the nice part about rANS is that all truly the per-lane state fits inside a single 32-bit word (R, here). So having say a 16-wide stream and a SSE4.1 impl that "macro-interleaves" four 4-wide decoders is reasonably practical. Especially since the 4x (SSE4.1) or 2x (AVX-2) interleaving of instances really helps with bubbles on the wider out-of-order machines.

    A colleague found in his tests that the sweet spot for Haswell using AVX-2, with a pretty similar basic approach, was 32x interleaved (4x 8-vectors), and gave around 1GB/s on a 2.4GHz Haswell core (possibly turbo-ing up to 3-ish GHz though, don't remember, it's been a while). 8x 4-vectors (SSE4.1) is definitely enough to avoid bubbles on older processors, and 2x 16-vectors (KNL) should work better than a single 16-vector on KNL. 32 lanes is a lot though, and means some very significant flush overhead. You need to encode a lot of data for that to be worthwhile.

  6. #4
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Thanks Fabian.

    "renorm. this is the interesting part" ... Absolutely! Simple interleaving is what I had before and it basically stops the compiler from auto vectorising due to dependence between loop iterations. I like your alternative, although it's clear we're never going to get automatic generation of this from the compiler.

    It had one real gather (my s3 array) and one "emulated gather" which boiled down to loads IIRC and not a vexpandd. Currently this has all been getting the compiler to vectorise for me, for quick turnaround of testing.

    Regarding branches, I fully agree that it's better to use branching than branchless code on this platform, but my experience is that it is usually WORSE at branch prediction than my i5. The saving factor is that the simpler CPU design means a branch miss is pretty minor cost.

    Branch test: my branched "4x16" starting point on enwiki8:

    KNL: 17.2% miss rate, 1.09 ins/cyc, 127MB/s
    corei5: 16.5% miss rate, 1.7 ins/cyc, 470MB/s

    Branchless version:
    KNL: 0.5% miss rate, 1.62 ins/cyc, 134MB/s
    corei5: 0.2% miss ratem 3.33 ins/cyc, 664MB/s

    In that example brancless made very little difference for KNL. On a lower entropy data set, the branched version has a significant lead (about 30% faster) as the branch miss rate drops to 10% ish, but on i5 it's still a 10-15% slowdown.

  7. #5
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    The part about avoiding branches was just for code that you mean to get vectorized. The predication support in AVX-512 has a pretty drastic effect on the cost of (structured) flow control for vectorized code.

    Auto-vectorizing C/C++ is indeed relatively hard-pressed to use VEXPAND/VCOMPRESS efficiently, but there's better programming models for this stuff anyway. Something like ISPC (http://ispc.github.io/) is a much better match for this kind of programming, and allows targeting several target instruction sets from the same source with relatively little fuss. Sadly, it's support for non-32-bit types is lacking (no packed_load_active for uint16s, for example), otherwise it'd be pretty much perfect to write this in.

  8. #6
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    The light has just gone on regarding vpexpandd! I made the office all look quizzical by emitting a loud "Ahh hah!". This is exactly what we need in order to remove the dependency between lanes of SIMD on the encoded buffer pointer. Many thanks, I'm following in your coat-tails once again.

    I too experimented in the past with SSE SIMD and ended up with much the same shuf, blend & popcnt approach to your implementation, however with 4 lanes it just wasn't enough to win out over the non-simd variants and my 4x16 non-vectorised code was beating both your and my SSE SIMD implementation for speed. I didn't get around to doing an avx2 order-0 implementation of it, which is a shame as logically speaking it ought to be best. I did have avx2 order-1 variant, but it was slower. That code is still using N distinct pointers though and suffers from scatter as by the nature of order-1 it has to be working on disjoint locations.

  9. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Some further findings.

    1) Trying to SIMDify my original code shows the index lookups are poorly done. Instead of having 16 pointers with 16 indices starting at 0, it's better to have 1 pointer with 16 indices starting at different offsets. Obvious, but it's much more directly translateable to a gather.

    Code:
        for (i=0; i < out_end; i+=NX) {
    #pragma omp simd
            for (z = 0; z < NX; z++) {
              uint32_t S = s3[R[z] & mask];
              uint16_t f = S>>(TF_SHIFT+8), b = (S>>8) & mask; uint8_t s = S;
      
              R[z] = f * (R[z] >> TF_SHIFT) + b;
              out[i+z] = s;
    
              uint16_t Z = sp[idx[z]]; // Move to start of function for best avx2 speed.
              uint32_t y = (R[z] << 16) | Z;
              idx[z] += R[z] < RANS_BYTE_L ? 1 : 0;
              R[z]    = R[z] < RANS_BYTE_L ? y : R[z];
            }
        }
    On my core-i5 with icc -march=core-avx2 this is decoding files at around 810-820MB/s and around 190MB/s on the KNL.
    On the KNL system with -xMIC-AVX512 this runs around 350MB/s.

    2) The SIMDified version of this code really does fly on AVX512. The compiler was clearly generating different intrinsics.

    Code:
         __m512i maskv = _mm512_set1_epi32(mask); // set mask in all lanes                                             
        __m512i Rz = _mm512_load_epi32(R);
        __m512i indices = _mm512_load_epi32(idx); // offsets from cpN[0]                                              
    
        for (i=0; i < out_end; i+=16) {
          //for (z = 0; z < 16; z++) {                                                                                
          //uint32_t S = s3[R[z] & mask];                                                                             
          __m512i masked = _mm512_and_epi32(Rz, maskv);
          __m512i S = _mm512_i32gather_epi32(masked, s3, sizeof(*s3));
    
          //uint16_t f = S>>(TF_SHIFT+8), b = (S>>8) & mask;                                                          
          __m512i f = _mm512_srli_epi32(S, TF_SHIFT+8);
          __m512i b = _mm512_and_epi32(_mm512_srli_epi32(S, 8), maskv);
    
          //R[z] = f * (R[z] >> TF_SHIFT) + b;                                                                        
          Rz = _mm512_add_epi32(_mm512_mullo_epi32(_mm512_srli_epi32(Rz, TF_SHIFT), f), b);
    
          //out[i+z] = S;                                                                                             
          _mm_storeu_si128((__m128i *)(out+i), _mm512_cvtepi32_epi8(S));
    
          // V[z] = sp[idx[z]];                                                                                       
          __m512i V = _mm512_i32gather_epi32(indices, sp, sizeof(*sp));
          V = _mm512_and_epi32(V, _mm512_set1_epi32(0xffff)); // want 16-bit components only.                         
    
          //R[z]    = R[z] < RANS_BYTE_L ? (R[z] << 16) | V[z] : R[z];                                                
          // aka:  R[z] = R[z] < RANS_BYTE_L ? R[z]<<16  : R[z];                                                      
          //       R[z] = R[z] < RANS_BYTE_L ? R[z]|V[z] : R[z];                                                      
          __mmask16 renorm_mask = _mm512_cmplt_epu32_mask(Rz, _mm512_set1_epi32(RANS_BYTE_L));
          Rz = _mm512_mask_slli_epi32(Rz, renorm_mask, Rz, 16);
          Rz = _mm512_mask_or_epi32(Rz, renorm_mask, Rz, V);
    
          //idx[z] += R[z] < RANS_BYTE_L ? 1 : 0;                                                                     
          indices = _mm512_mask_add_epi32(indices, renorm_mask, indices, _mm512_set1_epi32(1));
        }
    This runs at around 550-560MB/s on KNL.

    The gains over auto-vectorisation are huge, and it makes me want to implement a manual avx2 decode loop too to test that.

    3) Inserting Fabian's gather replacement doesn't seem to be beneficial. I cannot yet explain why as the logic is sound. I'm wondering if it is something else like cache utilisation or a slow popcnt implementation (perhaps pshufb will win out still).
    The code is more or less as Fabian posted with a couple very minor tweaks:

    Code:
        __m512i maskv = _mm512_set1_epi32(mask); // set mask in all lanes                                             
        __m512i R = _mm512_load_epi32(Rv);
    
        for (i=0; i < out_end; i+=16) {
          //for (z = 0; z < 16; z++) {                                                                                
          //uint32_t S = s3[R[z] & mask];                                                                             
          __m512i masked = _mm512_and_epi32(R, maskv);
          __m512i S = _mm512_i32gather_epi32(masked, s3, sizeof(*s3));
    
          //uint16_t f = S>>(TF_SHIFT+8), b = (S>>8) & mask;                                                          
          __m512i f = _mm512_srli_epi32(S, TF_SHIFT+8);
          __m512i b = _mm512_and_epi32(_mm512_srli_epi32(S, 8), maskv);
    
          //R[z] = f * (R[z] >> TF_SHIFT) + b;                                                                        
          R = _mm512_add_epi32(_mm512_mullo_epi32(_mm512_srli_epi32(R, TF_SHIFT), f), b);
    
          //out[i+z] = S;                                                                                             
          _mm_storeu_si128((__m128i *)(out+i), _mm512_cvtepi32_epi8(S));
    
          // renorm. this is the interesting part:                                                                    
          __mmask16 renorm_mask = _mm512_cmplt_epu32_mask(R, _mm512_set1_epi32(RANS_BYTE_L));
          __m512i renorm_words = _mm512_cvtepu16_epi32(_mm256_loadu_si256((const __m256i *) sp)); // next 16 words    
          __m512i renorm_vals = _mm512_maskz_expand_epi32(renorm_mask, renorm_words); // select masked only           
          R = _mm512_mask_slli_epi32(R, renorm_mask, R, 16); // shift & add selected words                            
          R = _mm512_add_epi32(R, renorm_vals);
          sp += _mm_popcnt_u32(renorm_mask); // advance by however many words we actually read                        
        }
    This runs at around 520MB/s.

    I have yet to identify why it is slower. I'll experiment with perf, but I only have 1 day left with this system. I haven't yet tried multiplexing together two sets of AVX512 calls.

    4) (Edit) A little more speed can be gained by placing more work between the initial s3[] gather and the usage. The easiest way I've found of doing this is to move it to the end of the loop, before updating the indices or sp pointer. Ongoing experimentation, but up to 570MB/s now.
    Last edited by JamesB; 23rd September 2016 at 00:14.

  10. #8
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    In the past, the "store" to "out+i" would've been a major problem, for aliasing reasons. (If out is a char* it can alias with anything, and I don't think the type casting helps here, but I have no idea what ICC does here).

    The code I gave you was intentionally as close to a 1:1 translation of your original code as possible; if you want it to go faster, I would recommend:
    1. Moving the _mm256_loadu_si256 from sp to the very top of the loop (near the gather)
    2. Sinking the s_mm_storeu_si128 of the output bytes all the way to the bottom (with no memory side effects left in the core loop body, this should allow much better scheduling)
    3. Definitely go 2x or even 4x "unrolled" (though still sort the memory accesses as mentioned above). You have plenty of regs (32 512-bit regs!) so that's not an issue, and IIRC the KNL went with the crazy 2-VPU design which means it's insanely heavy on vector execution resources and needs a lot of latent ILP in a vector instruction stream to really get going. The KNF/KNC I'm used to have "only" one VPU so it's less lopsided.
    4. Use at least two threads on top of everything else. Don't know if that changed in KNL, but at least on KNF/KNC, you can't hit more than half of the maximum throughput for a core with a single hardware thread, since each hardware thread takes 2 cycles to update its instruction pointer! (The instruction length determination was the critical path, and letting it take two cycles allowed clocking up the core an extra 10%; since the whole point of the design was to hide latencies with multiple hardware threads, always needing to have at least 2 runnable threads to achieve peak was not considered a problem). Mind, I know that "just thread this" is easier said than done, but it feels important to mention.
    5. popcnt *should* be fine, at least in KNF/KNC that got high priority (and its own special instruction, originally) simply because it is crucial for the use of COMPRESS/EXPAND (then VPACKSTORE/VLOADUNPACK) as queueing/dequeuing instructions which is their primary design purpose. There is an extra MOV though to get the compare mask from a mask register (which is logically its own register file, although I believe it physically comes out of the regular GPR renaming pool) to a GPR.
    6. (This is the part where I softly sigh about VLOADUNPACK which combines masked load+upconv+expand being removed, and also where I sigh about the integer VMADD231PI being removed which would be just perfect for the RANS update, and also where I really miss VROTATEFIELDPI and VINSERTFIELDPI that already got lost in the re-encoding leading up to KNC and would've made the field extraction much nicer... alas.)

  11. Thanks:

    JamesB (22nd September 2016)

  12. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    1 and 2 I'd already tried, in various guises (it's what I was doing in my "4) (edit)" comment above). It helped my split sp version a little but not the single sp one as it's still apparently bottlenecked somewhere else. I don't understand what yet and perf isn't fine grained enough to precisely pinpoint the instructions.

    3: yes for sure, although as you say it gets pretty heavy on the flushing!
    A very quick hack, just doubling up everything, shows a leap from ~560MB/s to 830-910MB/s depending on entropy. Pretty solid out of a 1.3GHz Cpu.

    4: this is part of a larger tool which handles threading itself, not so much this particular algorithm, but anything via a thread worker pool. The best I managed was a whopping 244x working threads (24400% CPU utilisation) with 256 being the peak possible, but generally it's pretty hard to keep it busy enough and my average was down near 100 to 150 threads in use. I wasn't aware of the 2 clock cycle thing - pretty mad! That explains how come it's so hard to get high instructions per cycle then.

    6: Indeed - I was trying to find the fused multiply adds and saddened once I realised they're floating point only barring some oddity with 52-bit integers. .

    Ultimately though I doubt this is going anywhere anyway. I just have the machine for a short while and wanted to see what it's capable of! Ideally we want a single file format that can be read efficiently on older machines while also being able to be parallelised on modern high-SIMD systems (ie old 32-bit platforms all the way up to AVX512). The lack of large SIMD mostly isn't a problem as it just boils down to old-school loop unrolling instead, but the core design needs to have a large enough unrolling to permit strong SIMD. This is also incompatible with our current formats, so it's really pipe-dream stuff at the moment. Getting quadruple the initial throughput is a solid start and there's still some room to squeeze more. Just a shame this code is at best only around 25% of my total CPU time. (The remainder being so spread out and varied there aren't many low hanging fruits.)

    Still, a new toy is always fun!

    Many thanks for your insights, as usual.
    Last edited by JamesB; 23rd September 2016 at 00:48.

  13. #10
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    An interleaved AVX-512 decoder example, using neighbouring packed bytes in the decode buffer 'sp'. Decodes 32 rans states at a time.

    Code:
        __m512i maskv = _mm512_set1_epi32(mask); // set mask in all lanes                                             
        __m512i R1 = _mm512_load_epi32(&Rv[0]);
        __m512i R2 = _mm512_load_epi32(&Rv[16]);
    
        int offset=0;
        for (i=0; i < out_end; i+=32) {
          //for (z = 0; z < 16; z++) {                                                                                
          //uint32_t S = s3[R[z] & mask];                                                                             
          __m512i masked1 = _mm512_and_epi32(R1, maskv);
          __m512i masked2 = _mm512_and_epi32(R2, maskv);
          __m512i S1 = _mm512_i32gather_epi32(masked1, s3, sizeof(*s3));
          __m512i S2 = _mm512_i32gather_epi32(masked2, s3, sizeof(*s3));
          __m512i renorm_words1 = _mm512_cvtepu16_epi32(_mm256_loadu_si256((const __m256i *) (sp+offset))); // next 1\
    6 words                                                                                                           
    
          //uint16_t f = S>>(TF_SHIFT+8), b = (S>>8) & mask;                                                          
          __m512i f1 = _mm512_srli_epi32(S1, TF_SHIFT+8);
          __m512i f2 = _mm512_srli_epi32(S2, TF_SHIFT+8);
          __m512i b1 = _mm512_and_epi32(_mm512_srli_epi32(S1, 8), maskv);
          __m512i b2 = _mm512_and_epi32(_mm512_srli_epi32(S2, 8), maskv);
    
          //R[z] = f * (R[z] >> TF_SHIFT) + b;                                                                        
          R1 = _mm512_add_epi32(_mm512_mullo_epi32(_mm512_srli_epi32(R1, TF_SHIFT), f1), b1);
          R2 = _mm512_add_epi32(_mm512_mullo_epi32(_mm512_srli_epi32(R2, TF_SHIFT), f2), b2);
    
          // renorm. this is the interesting part:                                                                    
          __mmask16 renorm_mask1 = _mm512_cmplt_epu32_mask(R1, _mm512_set1_epi32(RANS_BYTE_L));
          __mmask16 renorm_mask2 = _mm512_cmplt_epu32_mask(R2, _mm512_set1_epi32(RANS_BYTE_L));
          offset += _mm_popcnt_u32(renorm_mask1); // advance by however many words we actually read                   
          __m512i renorm_words2 = _mm512_cvtepu16_epi32(_mm256_loadu_si256((const __m256i *) (sp+offset)));
    
          //out[i+z] = S;                                                                                             
          _mm_storeu_si128((__m128i *)(out+i),    _mm512_cvtepi32_epi8(S1));
          _mm_storeu_si128((__m128i *)(out+i+16), _mm512_cvtepi32_epi8(S2));
    
          __m512i renorm_vals1 = _mm512_maskz_expand_epi32(renorm_mask1, renorm_words1); // select masked only        
          __m512i renorm_vals2 = _mm512_maskz_expand_epi32(renorm_mask2, renorm_words2); // select masked only        
          R2 = _mm512_mask_slli_epi32(R2, renorm_mask2, R2, 16); // shift & add selected words                        
          R1 = _mm512_mask_slli_epi32(R1, renorm_mask1, R1, 16); // shift & add selected words                        
          R1 = _mm512_add_epi32(R1, renorm_vals1);
          R2 = _mm512_add_epi32(R2, renorm_vals2);
          offset += _mm_popcnt_u32(renorm_mask2); // advance by however many words we actually read                   
        }
    
        _mm512_store_epi32(&Rv[ 0], R1);
        _mm512_store_epi32(&Rv[16], R2);
    My first code to break 1000MB/s (sometimes)

    Code:
    [rans_static]$ icc -O3 -g -xMIC-AVX512 r16_32b.c -DTEST_MAIN -DNX=32
    [rans_static]$ ./a.out -t -o0 ~/jkb/q8;echo;./a.out -t -o0 ~/jkb/q40
     72.8 MB/s enc, 998.8 MB/s dec   73124567 bytes -> 16853680 bytes
     71.6 MB/s enc, 949.5 MB/s dec   73124567 bytes -> 16853680 bytes
     73.1 MB/s enc, 999.7 MB/s dec   73124567 bytes -> 16853680 bytes
     70.8 MB/s enc, 974.5 MB/s dec   73124567 bytes -> 16853680 bytes
     72.9 MB/s enc, 1002.3 MB/s dec  73124567 bytes -> 16853680 bytes
     70.1 MB/s enc, 1001.0 MB/s dec  73124567 bytes -> 16853680 bytes
     72.9 MB/s enc, 1000.7 MB/s dec  73124567 bytes -> 16853680 bytes
     72.9 MB/s enc, 1003.1 MB/s dec  73124567 bytes -> 16853680 bytes
     71.9 MB/s enc, 1001.9 MB/s dec  73124567 bytes -> 16853680 bytes
     73.1 MB/s enc, 1002.6 MB/s dec  73124567 bytes -> 16853680 bytes
    
     59.2 MB/s enc, 904.9 MB/s dec   94602182 bytes -> 53696061 bytes
     59.3 MB/s enc, 956.5 MB/s dec   94602182 bytes -> 53696061 bytes
     56.9 MB/s enc, 972.4 MB/s dec   94602182 bytes -> 53696061 bytes
     59.2 MB/s enc, 782.8 MB/s dec   94602182 bytes -> 53696061 bytes
     59.2 MB/s enc, 972.7 MB/s dec   94602182 bytes -> 53696061 bytes
     58.5 MB/s enc, 961.5 MB/s dec   94602182 bytes -> 53696061 bytes
     58.8 MB/s enc, 968.6 MB/s dec   94602182 bytes -> 53696061 bytes
     57.3 MB/s enc, 947.7 MB/s dec   94602182 bytes -> 53696061 bytes
     59.1 MB/s enc, 969.7 MB/s dec   94602182 bytes -> 53696061 bytes
     59.2 MB/s enc, 966.5 MB/s dec   94602182 bytes -> 53696061 bytes
    I need to figure out how easy this is to code in SSE, AVX, AVX2, etc. Possibly the lack of masked expands makes it slower than using a split up 'sp' pointer. For sure sharing one 'sp' makes the auto-vectorisation give up.

    PS. Yes, I haven't even looked at optimising the encoder yet.

  14. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Still working on it. I have AVX2 as well as AVX512 implementations of the same algorithm now (32 interleaved rANS states) and an SSE one of a slightly different algorithm, pending adjustment to make it match. My plan is to use this on large blocks where SIMD parallelism is a good speed vs size tradeoff, and the old 4 way interleaving on smaller blocks (or maybe just FSEhuf).

    Code:
    @ deskpro107386[compress.../rans_sta...]; ./a.out -t /tmp/enwik8
    190.1 MB/s enc, 1272.9 MB/s dec  100000000 bytes -> 63660018 bytes
    189.8 MB/s enc, 1270.2 MB/s dec  100000000 bytes -> 63660018 bytes
    190.2 MB/s enc, 1272.8 MB/s dec  100000000 bytes -> 63660018 bytes
    190.1 MB/s enc, 1273.1 MB/s dec  100000000 bytes -> 63660018 bytes
    190.0 MB/s enc, 1271.0 MB/s dec  100000000 bytes -> 63660018 bytes
    190.1 MB/s enc, 1272.3 MB/s dec  100000000 bytes -> 63660018 bytes
    190.1 MB/s enc, 1273.6 MB/s dec  100000000 bytes -> 63660018 bytes
    190.1 MB/s enc, 1272.8 MB/s dec  100000000 bytes -> 63660018 bytes
    190.2 MB/s enc, 1272.7 MB/s dec  100000000 bytes -> 63660018 bytes
    190.0 MB/s enc, 1272.1 MB/s dec  100000000 bytes -> 63660018 bytes
    I may be able to go faster as I haven't optimised the location of my statements yet. Moving things around a bit may help. The AVX2 code is using a "standard" single decode pointer with load into m128, cvtepu16_epi32 and permutevar8x32 to shuffle those values into the correct places for blending. This is on a 3.2Ghz i5-4570. Clearly encoder needs work, but I haven't even touched it yet. Similarly the order-1 encoder/decoder hasn't been touched either, but I predict similar gains, albeit not similar final speed.

    I'll push them all to my github account once I've tidied them up some more and squeezed the last ounce out, but thanks for the hints Fabian - it's still paying off.

    Edit: Fsehuf on the same file is encoding at 620MB/s and decoding at 860MB/s. TurboANX with TurboBench -V7 -X1M (same size as my program) is encoding at 420MB/s and decoding at 610MB/s. That's probably realistic given Dnd's own benchmarks are 774MB/s decode, on a machine with 40% higher clockspeed. The penalty is large overhead per buffer (an extra 120 bytes or so), hence favouring large block sizes.

  15. Thanks:

    Jarek (27th September 2016)

  16. #12
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Cool! The i5-4570 can Turbo Boost up to 3.6GHz, is this a single-threaded test (which might actually boost that high if there's not much load on the system otherwise), or is Turbo Boost otherwise disabled? Curious whether this is at the nominal 3.2GHz or at a higher clock.

    Either way, I think that's a new record for rANS. My colleague's 32-wide rANS using AVX2 on a (lower-clock) Haswell laptop peaked at just over 1GB/s on our tests I believe. You should get a pretty big boost on Skylake and later CPUs since the gather got some more love in that uArch revision. Either way, yours works out to sub-3 cycles/byte, which is very respectable and a good deal faster than even most heavily optimized Huffman coders. :P

    As for ~1000 MB/s on the 1.3GHz KNL, that works out to an even 1.3 cycles/byte. With a 32-wide impl, that comes out at about 41.5 cycles for one iteration through the main loop, and it should be about ~32-33 instructions in that loop. If KNL still has the "threads may only issue every 2 cycles" limitation, that's a really good result. If not, it depends on how fast the gathers complete, which depends on how many unique cache lines you hit per gather (gathering takes at least 1 cycle per unique cache line hit, more if there are misses), which in turn depends on your frequency distribution. Either way, it's definitely in the right ballpark.

    How are you doing the final flush? You can reduce (but not eliminate) the flush overhead with wide-interleaved rANS by having the encoders flush into each other. https://fgiesen.wordpress.com/2015/1...s-in-practice/ has the basic idea ("Tying the knot"). It boils down to a standard reduction tree: reduce streams from 32 to 16 (typically either by pairing each stream with its immediate neighbor, or with the one 16 streams over), then 16 to 8, 8 to 4, and so forth.

  17. #13
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    This is single threaded and I don't think it has TurboBoost. I don't have root access on my work machine, so I can't look in the proper places. However using hwloc-bind to peg it to one cpu and then regularly grepping /proc/cpu for MHz shows it is pegged at a steady 3201 so I don't believe it is enabled. (It drops back to 800 once the job finishes, so some level of auto-scaling is enabled though.) I still have some tweaks I want to do so hoping there is a little more room to be had. I won't know until I experiment with moving things around a little to change the interleaving. Right now it's a dumb 2-way interleave followed by a 2nd 2-way interleave, as a full 4-way AVX2 interleave runs out of registers and it starts to spill things to memory. More careful selective interleaving may work though.

    As for the final flush, I haven't looked at that yet so it's still the trivial approach. I like your ideas on this and thanks for the reminder again. I will read your paper once more too.

    I also now have some intriguing memory requirements which may be fun to deal with. The read-ahead of N states worth of renormalisation means that we need 2N bytes of extra memory beyond the end of the buffer. We can either just allocate +2N when we do the reading and make it a requirement of the calling function to help us out, or we bail out one cycle early and do a slower non-simd decode for the final set. It's all polishing though and I haven't thought much about this yet when there are much bigger issues to tackle - like my order-1 rANS codec and the encoding side for both order 0/1. Logically the same speedups are available, although order-1 is a bit trickier with regards to memory accessing as it has to be disjoint by design; each rANS state is tracking a single non-interleaved stretch of the buffer.

    Regarding KNL - I've no idea if they kept the 2 cycles per instruction decode. The number of measured instructions per cycle for my original non-SIMD code was around half on this box compared to my desktop, but that doesn't necessarily mean much. I assumed it meant it just had fewer execution ports available, but it may have also meant slower instructions. I'd need to do some simply assembly loops to time it I think, but no longer have access to the system.

    Anyway, more to come.

  18. #14
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    Extra memory: you also need to handle this in case data is ever corrupted, else it's a potential buffer overflow vector with the right data (always bad). However, you can handle this with a fairly low amount of extra work by simply copying the tail end of the input buffer to a separate (on-stack) buffer with room for 4N words (zero-padded after the end of the "real" data) once you get close (details here: https://fgiesen.wordpress.com/2016/0...decompressors/). The only extra work in the hot path of the main loop is the "if (input_cursor >= input_mark)" check which is just a compare and a (well-predicted) branch, and it means you don't need a separate fiddly "careful" decoder. It's not a huge deal either way but it gives pretty good bang for the buck.

    BitKnit does all its input buffer over-read protection that way (its bitstream was designed to be suitable for it): there's a single copy of the main decoder loop that uses branchless rANS renorm and can pause and resume at the top of the decoder main loop (but not anywhere else). I found that to be much less hassle (and less of a speed hit to boot!) than turning the entire decoder into a state machine (e.g. zlib-style). The price you pay is that the "pause" condition is conservative: say your upper bound is that one iteration of the main loop will never consume more than 20 bytes. But at some point there really are fewer than 20 bytes left in the input stream! So the outer loop must add padding bytes (we use zeros) at the end to "drain the pipe". (You monitor if any of those padding bytes actually were consumed, and return a "malformed bitstream" error if so). The problem is knowing when to insert those 20 bytes. The version of BitKnit that ships in Granny plays cute tricks with the framing to shave a few bytes per stream (there can be several in a Granny file), and as a result makes this really nasty. I would not do that again. But if you're willing to spend a few bytes to flag the size of the compressed bitstream (or of the last flush chunk, anyway), this all goes away, and things stay nice and simple.

    You can use the exact same trick to transition between two input buffers (say again the zlib-style buffer API, where you provide input in variable-sized chunks). The idea is that most of the time, you run straight off the input buffer. But as you get close to the end (say within 20 bytes again), you copy the remaining tail bytes to a scratch buffer, and bail, letting the producer feed in some more compressed data. Then to resume, you add 20 bytes from the start of your new buffer to your scratch buffer, and decode until your input pointer (now in the scratch buffer!) points past the last byte from the previous source buffer. After that, you can resume decoding from the new input buffer.

    The overall process is one of decoding large blocks of data straight from the provided input buffers, with short runs of decoding from a tiny scratch buffer (set up to cover the transition from the previous buffer to the next one) between. The padding at the end is basically just a special case of this, where you add an extra 20 bytes (or whatever) after every valid stream. Compared to a full state machine, this method is a bit "sloppy"; the decoder may decide to stop even when it would not actually over-run the buffer. And you really do need the extra information in the framing - an in-band "end of stream" marker doesn't work here, since you need to know in advance that the stream is about to end. Finally, the buffer management logic in the outer loop is a bit tricky. But the nice part about it is that it really is all in the outer loop: no crudding up the inner loop with state machine logic, no need for a separate "fast" decoder that runs as long as you know you're not close to the end, and the outer logic can be written once and can be tested independent of the actual bitstream decoder itself (which makes it easier to get both right).

    This part has been shipping in Granny for over a year (which uses it to feed compressed data in incrementally during file decompression, since we want to only use a small IO buffer if possible) and works fine. The self-inflicted unnecessary complexity of my trying-too-hard framing aside, I'm quite happy with it.

    All that said, it's not always practical. For example, it doesn't work well with non-interleaved aggressively multi-stream layouts: checking one stream per main loop iteration is fine, but if you have to check sixteen, not so much, and the logistics become quite awkward. But even when you can't use it directly, it's still a useful building block.

  19. Thanks (3):

    JamesB (27th September 2016),m^3 (28th September 2016),Turtle (28th September 2016)

  20. #15
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Is there anything you haven't already posted about?

    The non-interleaved rans stream, at least my implementation, had less of an issue than the interleaved "classic" case. Specifically if I have 32 decode pointers then 31 of the 32 are somewhere in the middle of my single buffer (passed in) and only the last one runs the risk of overrunning, which can also be only by one uint16_t. Having all 32 interleaved into the same decode pointer means fetching the possible 32 underflow states and therefore reading beyond the buffer end by a larger amount.

    However I was planning on solving it without any additional checks in the main loop simply by stopping one round early to guarantee it cannot happen. I don't decode until I see a specific EOF symbol, rather the stream starts with the size of the decoded data (so I can also check the supplied output buffer is large enough up front, as writing beyond the end of a buffer is much more of a security risk than simply reading beyond it). I also have the deal with the case of the remainder where the buffer isn't an exact multiple of 32. So that tail loop to handle the remainder could just handle 32+remainder instead. I haven't decided yet whether that is better than simply having a call with an upfront requirement on the input buffer to be rounded to a multiple of 32 (or 64.. whatever it turns out to be), ensuring that any read-overflow is valid and failure to do so is programming error rather than input data error.

    PS: New top machine. Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz. It is at 2.6GHz too when running, according to /proc/cpuinfo.
    Code:
    @ hm-02[compress.../rans_sta...]; ./a.out -t ~/scratch/data/enwik9
    203.0 MB/s enc, 1414.8 MB/s dec     1000000000 bytes -> 644262934 bytes
    203.1 MB/s enc, 1383.6 MB/s dec     1000000000 bytes -> 644262934 bytes
    201.5 MB/s enc, 1416.7 MB/s dec     1000000000 bytes -> 644262934 bytes
    201.6 MB/s enc, 1416.5 MB/s dec     1000000000 bytes -> 644262934 bytes
    201.7 MB/s enc, 1390.1 MB/s dec     1000000000 bytes -> 644262934 bytes
    That works out around 1.9 cycles per byte. The system is Broadwell, so it has better availability of various execution units compared to my Sandy-Bridge desktop.
    Last edited by JamesB; 27th September 2016 at 16:30.

  21. #16
    Member
    Join Date
    Dec 2015
    Location
    US
    Posts
    57
    Thanks
    2
    Thanked 112 Times in 36 Posts
    I think I've now linked to my entire backlog of writings about rANS; so this is it.

    Having multiple non-interleaved streams means you're unlikely to hit the nasty cases on non-corrupted data, but you still need to handle them if you want to be robust. For an example of what can go wrong, consider a 2-symbol order-0 alphabet with P(0)=1/4096 and P(1)=4095/4096 (entropy: 0.004bits/symbol!), and furthermore say that you have 4 streams. Now say the data for that first stream is corrupted and some 50 bytes in the middle got overwritten with zeroes, meaning you'll decode a bunch of 0 symbols and consume 12 bits each time, and advance the stream pointer well beyond what it would have done on the non-corrupted data, very close to 50 bytes further since the entropy is so low. That's 400 bits and, at our entropy, corresponds to about 100000 symbol's worth, and might well be larger than the size of the other 3 streams summed together if your block isn't that long!

    With valid data (and a sequential stream layout), the decode pointers will always be nicely ordered: ptr[0] <= ptr[1] <= ... <= ptr[Nstreams-1], and checking just the last one is sufficient (since we don't really care if the other streams happen to read garbage as long as they don't read out of bounds, we can detect that it happened later and report data corruption then). But in particular with highly skewed data, even small corruptions can make one of the pointers overtake another, at which point all bets are off. There are also issues when the streams have wildly different sizes, which can happen if your interleaving pattern happens to line up with some other structure in your data: for another cartoon example, if you're order-0 coding a bunch of small-ish 32-bit ints with 4x interleaving, the streams containing the MSBs will generally be a good deal smaller than the LSB streams.

    Having multiple sequential streams does give you a fair amount of slack for the extra streams - for example, it's fine to only test each stream pointer against a single "getting close to the end" checkpoint for the last stream, and not bother about checking against the precise bounds for each individual stream until the main loop is done - but you really do need to monitor all of them.

  22. Thanks (2):

    JamesB (28th September 2016),SolidComp (1st October 2016)

  23. #17
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Still a work in progress, with zero effort on the encoding side yet. I need to merge all these variants together into a single file with auto-detection of CPU type too and a more efficient non-SIMD version.

    However see source + Ubuntu Trusty linux binary; for AVX2 only!

    Code:
    @ deskpro107386[compress.../rans_sta...]; ./r32x16b_avx2 -o0 -t /tmp/enwik8 
    197.3 MB/s enc, 1288.6 MB/s dec  100000000 bytes -> 63660018 bytes
    197.2 MB/s enc, 1288.2 MB/s dec  100000000 bytes -> 63660018 bytes
    197.4 MB/s enc, 1283.6 MB/s dec  100000000 bytes -> 63660018 bytes
    195.9 MB/s enc, 1287.6 MB/s dec  100000000 bytes -> 63660018 bytes
    197.1 MB/s enc, 1288.3 MB/s dec  100000000 bytes -> 63660018 bytes
    196.8 MB/s enc, 1289.0 MB/s dec  100000000 bytes -> 63660018 bytes
    197.1 MB/s enc, 1287.9 MB/s dec  100000000 bytes -> 63660018 bytes
    197.1 MB/s enc, 1289.7 MB/s dec  100000000 bytes -> 63660018 bytes
    197.3 MB/s enc, 1288.4 MB/s dec  100000000 bytes -> 63660018 bytes
    197.4 MB/s enc, 1291.9 MB/s dec  100000000 bytes -> 63660018 bytes
    
    @ deskpro107386[compress.../rans_sta...]; ./r32x16b_avx2 -o1 -t /tmp/enwik8 
    106.8 MB/s enc, 676.7 MB/s dec  100000000 bytes -> 49873915 bytes
    107.5 MB/s enc, 679.2 MB/s dec  100000000 bytes -> 49873915 bytes
    107.3 MB/s enc, 676.7 MB/s dec  100000000 bytes -> 49873915 bytes
    107.3 MB/s enc, 677.1 MB/s dec  100000000 bytes -> 49873915 bytes
    106.9 MB/s enc, 668.8 MB/s dec  100000000 bytes -> 49873915 bytes
    106.8 MB/s enc, 677.0 MB/s dec  100000000 bytes -> 49873915 bytes
    107.0 MB/s enc, 675.9 MB/s dec  100000000 bytes -> 49873915 bytes
    107.1 MB/s enc, 676.4 MB/s dec  100000000 bytes -> 49873915 bytes
    107.3 MB/s enc, 676.8 MB/s dec  100000000 bytes -> 49873915 bytes
    107.2 MB/s enc, 679.6 MB/s dec  100000000 bytes -> 49873915 bytes
    Encode speeds are pretty rubbish - well below my other code. I'll get to it sometime...
    Source is a mix of mine and Fabian's original rANS implementation.

    PS. No, I won't build a windows binary, sorry!
    Attached Files Attached Files

  24. Thanks:

    Jarek (29th September 2016)

  25. #18
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    84
    Thanks
    34
    Thanked 26 Times in 17 Posts
    It is fast tested on i7 6500U 2.5Ghz 3.1Ghz boost
    Code:
    174.4 MB/s enc, 1514.6 MB/s dec     100000000 bytes -> 63660018 bytes
    174.2 MB/s enc, 1507.7 MB/s dec     100000000 bytes -> 63660018 bytes
    172.9 MB/s enc, 1518.0 MB/s dec     100000000 bytes -> 63660018 bytes
    174.5 MB/s enc, 1519.7 MB/s dec     100000000 bytes -> 63660018 bytes
    174.1 MB/s enc, 1517.3 MB/s dec     100000000 bytes -> 63660018 bytes
    167.8 MB/s enc, 1349.9 MB/s dec     100000000 bytes -> 63660018 bytes
    154.4 MB/s enc, 1466.3 MB/s dec     100000000 bytes -> 63660018 bytes
    174.0 MB/s enc, 1513.3 MB/s dec     100000000 bytes -> 63660018 bytes
    174.5 MB/s enc, 1517.5 MB/s dec     100000000 bytes -> 63660018 bytes
    174.2 MB/s enc, 1517.6 MB/s dec     100000000 bytes -> 63660018 bytes
    for comparison
    Code:
        62957995    63.0     575.89     829.02   TurboHF         enwik8
        63202025    63.2     336.05     449.09   fse             enwik8
        63420890    63.4     555.21     763.74   fsehuf          enwik8
        63287917    63.3     244.14     615.05   rans_static16   enwik8

  26. Thanks:

    JamesB (30th September 2016)

  27. #19
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Thanks for testing.

    So ratio is suffering a little bit, probably due to overheads of all those states. I haven't yet had time to "tie the knot" as Fabian describes it.

    How is the order 1 going? Use -t -o1 for that. It's a very rough approximation as it decreases the probability precision to just 9 bits! However that is a deliberate tradeoff to get some compression (more than order-0) while not exploding memory usage so high. With fewer bits it makes the cache misses less likely, particularly on low-entropy data (fortunately the norm for my project).

    It's possible there are subtle differences in the testing framework of TurboBench and my code too so ideally like for like I'd make a pull request to add it in. I'll do that after I get time to polish it more, adding SSE, non-SSE, encoder speedups, etc. (Not that soon alas.)

  28. #20
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    232
    Thanked 235 Times in 145 Posts
    These speeds are crazy!
    3 years ago the state-of-art for accurate entropy decoding was <100MB/s ...

    Regarding storing the states, which seems to be the reason for lower compression ratio here, it can be compensated by storing some information in the initial state:
    - simpler way: instead of starting with the lowest possible state, e.g. 2^16, start with 2^16 + "first 16 bits of the stream",
    - better way: start with state 1, use e.g. rANS encoding rules until above 2^16 (without renormalization).
    Additionally, a fraction of bit can be saved by using entropy coder for the final state: Pr(x) is proportional to 1/x.

  29. Thanks:

    algorithm (1st October 2016)

  30. #21
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    84
    Thanks
    34
    Thanked 26 Times in 17 Posts
    order 1
    Code:
    105.3 MB/s enc, 750.3 MB/s dec     100000000 bytes -> 49873915 bytes
    105.4 MB/s enc, 741.2 MB/s dec     100000000 bytes -> 49873915 bytes
    105.2 MB/s enc, 745.7 MB/s dec     100000000 bytes -> 49873915 bytes
    105.3 MB/s enc, 753.6 MB/s dec     100000000 bytes -> 49873915 bytes
    105.2 MB/s enc, 748.7 MB/s dec     100000000 bytes -> 49873915 bytes
    105.3 MB/s enc, 743.3 MB/s dec     100000000 bytes -> 49873915 bytes
    105.1 MB/s enc, 744.7 MB/s dec     100000000 bytes -> 49873915 bytes
    105.4 MB/s enc, 749.6 MB/s dec     100000000 bytes -> 49873915 bytes
    105.3 MB/s enc, 745.2 MB/s dec     100000000 bytes -> 49873915 bytes
    105.4 MB/s enc, 741.1 MB/s dec     100000000 bytes -> 49873915 bytes
    I think order 1 is not suited for static frequencies.Maybe an adaptive or semi-adaptive entropy coder behaves better.
    Also i think it is not neccesary to use the whole previous byte as context.A hash of the byte to, for example, 4 bits is maybe a better choice(maybe a simple xor).
    A question,do getopt functions compile for windows?

  31. Thanks:

    JamesB (1st October 2016)

  32. #22
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Order 1 isn't generally wise for static frequencies, but the data I use it on happens to neatly fit this well - DNA assembly quality values. They exhibit a reasonable order-1 correlation (and a weak order-2), but are otherwise pretty uniform across the entire data set. Hence there is no real need for adjust and tracking of different statistics in different regions of the file. Additionally we require random access, so multiple distinct blocks of data fits this well. It all adds up to making static frequencies a reasonable solution vs a slower more adaptive approach. I freely admit this is a bit of a niche case though!

    I realise now why the ratio on enwiki8 isn't as good as fse etc - it'll simply be block size. My demo has a relatively large block size (it's a #define so can be adjusted on the command line compiler params), partly for speed but also so that storing the order-1 frequencies don't end up dominating the file size. This also tends to be why the rans_static16_o1 sizes reported on the TurboBench site aren't that great - the block size used is too small for that algorithm to perform well.

    That said, even on enwiki8 it was showing faster decodes in order-1 than FSE order-0. Ignoring the huge hole of rubbish encoding performance (currently), I wonder how well ZStd would work with large block sizes and an order-1 encoding of literals instead of order-0. Of course this is AVX2 only at present (and AVX512 but that's pretty exotic still) so more work to be done yet.

  33. #23
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Quote Originally Posted by JamesB View Post
    Clearly encoder needs work, but I haven't even touched it yet. Similarly the order-1 encoder/decoder hasn't been touched either, but I predict similar gains, albeit not similar final speed.
    Ugh how wrong that statement was! :-/

    I appear to be hitting differences between compilers somewhat too. My latest code:

    Code:
    icc-15.0:
    @ deskpro107386[compress.../rans_sta...]; ./a.out -t -o0 /var/tmp/enwik8
    389.7 MB/s enc, 1252.1 MB/s dec     100000000 bytes -> 63660018 bytes
    389.0 MB/s enc, 1252.5 MB/s dec     100000000 bytes -> 63660018 bytes
    @ deskpro107386[compress.../rans_sta...]; ./a.out -t -o1 /var/tmp/enwik8
    210.6 MB/s enc, 750.1 MB/s dec     100000000 bytes -> 49873915 bytes
    210.2 MB/s enc, 751.9 MB/s dec     100000000 bytes -> 49873915 bytes
    
    gcc-4.8
    @ deskpro107386[compress.../rans_sta...]; ./a.out -t -o0 /var/tmp/enwik8
    437.8 MB/s enc, 1329.6 MB/s dec     100000000 bytes -> 63660018 bytes
    435.4 MB/s enc, 1323.0 MB/s dec     100000000 bytes -> 63660018 bytes
    @ deskpro107386[compress.../rans_sta...]; ./a.out -t -o1 /var/tmp/enwik8
    180.3 MB/s enc, 674.5 MB/s dec     100000000 bytes -> 49873915 bytes
    176.8 MB/s enc, 676.2 MB/s dec     100000000 bytes -> 49873915 bytes
    
    gcc-6.1
    @ deskpro107386[compress.../rans_sta...]; ./a.out -t -o0 /var/tmp/enwik8
    426.0 MB/s enc, 1255.5 MB/s dec     100000000 bytes -> 63660018 bytes
    424.0 MB/s enc, 1257.3 MB/s dec     100000000 bytes -> 63660018 bytes
    @ deskpro107386[compress.../rans_sta...]; ./a.out -t -o1 /var/tmp/enwik8
    189.1 MB/s enc, 663.2 MB/s dec     100000000 bytes -> 49873915 bytes
    188.2 MB/s enc, 662.3 MB/s dec     100000000 bytes -> 49873915 bytes
    So icc favours order-1 and gcc favours order-0, with older gcc being better.
    If I run these binaries on a broadwell machine then the icc order-0 decoder is MUCH faster than the gcc one (almost double). This is same gcc built binary on haswell vs broadwell. It looks to be choking on a permute, so maybe I need to do some reordering of my code. :/

    Anyway, the summary is SIMD encoder is tricky due to the large number of things we need to gather / load / set with rANS; eg all those reciprocals for emulating the modulo step. This is where tANS would have a huge win. I really should have started down that road in 2014 when I first started investigating ANS as it's clearly the more appropriate method for my work! I'm sure tANS would vectorise very well given enough simultaneous lanes to operate on.

  34. Thanks (2):

    Cyan (16th October 2016),Jarek (15th October 2016)

  35. #24
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    232
    Thanked 235 Times in 145 Posts
    It's funny that while you say encoding is much more costly, there is finally Daala paper with some throughputs ( https://arxiv.org/pdf/1610.02488.pdf ) ... and it says the opposite:



    This is for static situation and alphabet size limited to 16 - in this setting your 400MB/s encoding and 1300MB/s decoding multiplies by 4 bits/symbol: translates to ~1600Mbps encoding and ~5200Mbps decoding.
    VP9 was using binary alphabet, Daala EC is Moffat's approximation of RC - the ratio loss is up to 3% for binary alphabet.

    This paper also still insists that one cannot do adaptive rANS: "However, unlike rANS, it is possible to modify the implementation to encodewith estimated probabilities rather than optimal probabilities." ...

    update: Timothy has cleared the adaptive rANS statement: https://groups.google.com/a/webmproj...Y/KLm822GIAQAJ
    Last edited by Jarek; 16th October 2016 at 16:18.

  36. #25
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Thanks Jarek. I think my test data set only had 40 symbols, but I have one with 9 symbols and it isn't tremendously different in speed (a little faster). Yes I could pack pairs of those into a byte and then measure the encoding speed of that, but it's cheating and not comparing like for like. I don't understand where you got the 1600/5200 figures from.

    I only glanced over the paper but couldn't see any obvious statement on the hardware they ran those benchmarks on. The encoder speed is reasonable, but 182MB/s decode is pretty sluggish by modern standards.

    For what it's worth, I'm sure tANS is the way to go for high SIMD speeds anyway and it's much more symmetric in encoder/decoder timings.

    Edit: the other thing I see here is no SIMD code. They're using the same division method that Fabian (and hence myself) use: multiplication by a 64 bit reciprocal and a 32-bit shift right. This directly translates to a mulhi x86_64 instruction, but I couldn't find a direct analogue in SIMD (unless using Knight's Corner). This means the encoder doesn't parallelise as efficiently as the decoder. That's where I made big gains in decode and only small gains in encode.

    Their decoder likely could be a lot faster too. It seems they're using 8 bit renormalisation (as does CRAM for that matter) and it just has a while loop in the renormalisation (it's better to have two if statements so the branch prediction can work). Indeed we found the first one can be a combination of CMOV & ADC assembly instructions, with the second renormalisation being straight conditional as it's rare and hence easy to predict. However this all depends on CPU. KNC disliked Cmov and preferred branches for both, presumably because it's a much simpler design and failing a branch prediction isn't costly.

  37. #26
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by JamesB View Post
    Their decoder likely could be a lot faster too. It seems they're using 8 bit renormalisation (as does CRAM for that matter) and it just has a while loop in the renormalisation (it's better to have two if statements so the branch prediction can work). Indeed we found the first one can be a combination of CMOV & ADC assembly instructions, with the second renormalisation being straight conditional as it's rare and hence easy to predict. However this all depends on CPU. KNC disliked Cmov and preferred branches for both, presumably because it's a much simpler design and failing a branch prediction isn't costly.
    Please tell them so.

  38. #27
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    Quote Originally Posted by JamesB View Post
    Edit: the other thing I see here is no SIMD code. They're using the same division method that Fabian (and hence myself) use: multiplication by a 64 bit reciprocal and a 32-bit shift right. This directly translates to a mulhi x86_64 instruction, but I couldn't find a direct analogue in SIMD (unless using Knight's Corner).
    SSE2 has 32*32=64 multiplication
    SSE4 added 32*32=32 multiplication
    AVX512 added 64*64=64 multiplication

    so if i understood you correctly, the thing you need available since sse2 and called PMULUDQ

  39. #28
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    232
    Thanked 235 Times in 145 Posts
    Hi James, so the throughputs are because these are in "Mbps" (bits not bytes!) - as yours are in bytes or symbols, using 4-bit alphabet you have to multiply yours by 4 to get Mbps.

    Regarding decoding speed, I think the main problem is that instead of table to determine symbol, they are currently using linear(!) search (not even binary):
    for (i = 0; rem >= (top_prob = cdf[i]); ++i) {cum_prob = top_prob;}
    https://aomedia.googlesource.com/aom...sp/ansreader.h

    LZNA has this nice SIMD search for the symbol: https://github.com/powzix/kraken/blob/master/lzna.cpp
    Eventually, there is alias rANS: https://fgiesen.wordpress.com/2014/0...distributions/

  40. #29
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    SSE2 has 32*32=64 multiplication
    SSE4 added 32*32=32 multiplication
    AVX512 added 64*64=64 multiplication

    so if i understood you correctly, the thing you need available since sse2 and called PMULUDQ
    Yes I can use 64*64 multiplication, but my state is 32-bit. To get the top 32-bits of a 64-bit multiply I either need to turn my 8x32 lanes into two sets of 4x32 lanes, do the multiplications and the shifts, and turn them back, or else do multiple multiplies and shifts keeping it in 8x32 format, or perhaps just keep everything in 64-bit from the start. It's doable, but whatever I do is costly compared to the mulhi instruction. That took two 32-bit numbers, multiplied and returned a new 32-bit number corresponding to (a*b)>>32. There *IS* an instruction to do precisely this on knight's corner, but not avx2.

    Edit: Sorry forgot that the mul does 32-bit to 64-bit for us, but still then need to pack the 64-bit back down to 32-bit again, so a direct parallel equivalent to mulhi_epi32 would be dup to (mul_epu32 + shift) and (dup + permute + mul_epu32 + shift) followed by permutes or shuffles and OR the split vectors back together again.
    Last edited by JamesB; 17th October 2016 at 09:52.

  41. #30
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    Quote Originally Posted by Jarek View Post
    Regarding decoding speed, I think the main problem is that instead of table to determine symbol, they are currently using linear(!) search (not even binary):
    for (i = 0; rem >= (top_prob = cdf[i]); ++i) {cum_prob = top_prob;}
    https://aomedia.googlesource.com/aom...sp/ansreader.h

    LZNA has this nice SIMD search for the symbol: https://github.com/powzix/kraken/blob/master/lzna.cpp
    Eventually, there is alias rANS: https://fgiesen.wordpress.com/2014/0...distributions/
    Ouch, linear search would hurt big time indeed. That explains how come they ended up with the encoder being faster than the decoder then.

    Fabian/Charles' SIMD searching and updating for true adpative rANS is very clever indeed. It makes for a fast adaptive codec that can give good compression. However it doesn't make for a fast static codec with stationary probabilities. To do that the lookup table is the easiest approach, but as several people have already pointed out - if you're using lookup tables in a rANS implementation anyway, why not use tANS? tANS SIMD should be doable, although it'd require some lookups to gather the variable number of bits per SIMD lane together. tANS also doesn't have the issues of alias rANS that it shifts complexity from decoder to encoder. Although alias rANS may well work very well for the order-1 model.

    (The answer to why I started working with rANS btw was simply that I didn't understand either when I started down this road and so plumped for the code that looked easiest to understand at a cursory glance!)

Page 1 of 2 12 LastLast

Similar Threads

  1. Replies: 33
    Last Post: 2nd July 2018, 20:18
  2. Replies: 45
    Last Post: 25th November 2016, 03:30
  3. Intel AVX-512, MPX and SHA Extension
    By Bulat Ziganshin in forum Download Area
    Replies: 7
    Last Post: 1st June 2015, 21:43

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •