No branches, no loops, and some interesting new method to extract the information,
like CRC trick in first post.
If it could be somehow faster than movebits/tzcnt combination - would be the best, but unlikely.
But I'm mostly interested in finding rare instructions that could be applied here.
Intel SIMD is very asymmetric, so there're many unusual instructions.
> and portability requirements of the function you are writing
AVX2 is written in thread title. But AVX512 would be also okay, if its interesting.
The second condition (without SSE) is easier to understand: binary search to find the bucket given ranges. AVX would definitely be faster in the case N >= 16.
Yes, that's what I'm talking about.
But clz/tzcnt is a relatively slow scalar operation (3clk latency on intel), so I keep thinking that there might be a better way, like that minpos instruction.
Btw, for two-part scanning it may be better to use popcnt instructions.
No branches, no loops, and some interesting new method to extract the information,
like CRC trick in first post.
If it could be somehow faster than movebits/tzcnt combination - would be the best, but unlikely.
But I'm mostly interested in finding rare instructions that could be applied here.
Intel SIMD is very asymmetric, so there're many unusual instructions.
It is an arithmetic coder for 16 symbols? (or prefix/rice/ANS/something else...)
Note that here you don't really need "|(idx_mask<<16)" - even with BSR its simpler to initialize index register with the right value,
and tzcnt/popcnt don't have that problem at all.
Also no need for packs() - its actually better if tzcnt returns *2 value, since you can directly index a 16-bit counter table with that.
"LZNA gets its speed from two primary changes. 1. It uses RANS instead of arithmetic coding. 2. It uses nibble-wise coding instead of bit-wise coding, so it can do 4x fewer coding operations in some cases."
This is still very confusing to me. I always considered all ANS to work with more than one bit at a time. I think it is better to write out the requirements of the work in clear text instead of thinking that everyone will be automatically on the same page with you.
> LZNA gets its speed from two primary changes. 1. It uses RANS instead of arithmetic coding.
There's more hype than proof for this.
In theory, rANS consists of the same arithmetic operations as RC, just in different order.
Currently ANS has more optimized SIMD implementations of interleaved coding,
which are more efficient on modern out-of-order CPUs than old RCs.
But there's no actual proof of ANS superiority.
Same for tANS - its main benefit is not being covered by old AC patents,
but state-machine AC implementations existed forever, and at FSM level
the theory behind it doesn't really matter - CPU operations are the same.
> 2. It uses nibble-wise coding instead of bit-wise coding, so it can do 4x fewer coding operations in some cases.
Yes, but I'm using it with RC actually. I never said anything about ANS in original post.
> I always considered all ANS to work with more than one bit at a time.
No, the main difference is that ANS is LIFO, while RC is FIFO,
arithmetics are similar (especially in binary case).
Encode a symbol interval [symLow..symLow+symFreq) of [0..symTotal):
rANS: state' = (state/symFreq)*symTotal+symLow+(state%symFreq);
RC: low' = low+(range/=symTotal)*symLow; range*=symFreq;
This can be illustrated with versions of decimal-to-binary conversion:
s="12345";
1) while( c=*s++ ) value = value*10+c-'0'; // rANS way, requires LIFO decoding
2) r=100000; while( c=*s++ ) value += (r/=10)*(c-'0'); // RC way, FIFO decoding is ok
That is, ANS is a complete equivalent of AC,
coding of non-binary alphabets is just a speed optimization,
which is also applicable to AC.
> I think it is better to write out the requirements of the work in clear text
> instead of thinking that everyone will be automatically on the same page with you.
This thread was supposed to be about specific AVX2 implementation of a specific algorithm
(finding a symbol interval by code value), which was mentioned on this forum many times,
so getting complaints is pretty strange.
But I can explain in detail if you have any questions.
Yes, but I'm using it with RC actually. I never said anything about ANS in original post.
When topic is complicated or advances state-of-the-art, it makes better discussion threads if you give an introduction, then explain what are the requirements or general ideas of your improvement over the state-of-the-art, what is your general planned approach, what things to avoid, etc. -- then add the code example, too.
Most readers don't read your stand-alone code or try to reverse engineer the requirements from it.
I agree in general, and usually post more introduction text, but this specific topic is in style of https://graphics.stanford.edu/~seander/bithacks.html
and requires having experience with SIMD programming using intrinsics.
Explaining it in plain english would be both time-consuming and boring for _all_ readers, so I'd get even less feedback.
Also I already tried explaining it to you here: https://encode.su/threads/3542?p=67987&pp=1
But I can try different wording.
1. SIMD can be used to improve the speed of algorithm implementation.
2. There's a class of so-called "horizontal" SIMD operations: https://rust-lang.github.io/packed_s...t-hor-ops.html
3. There's a significant asymmetry in the design of x86 SIMD - there're many instructions like https://www.felixcloutier.com/x86/phminposuw
that are only defined for some specific vector size and/or element types, and require additional arrangements before using them.
4. In this thread I'm asking whether I missed some ways of horizontal scan implementation, maybe based on some other obscure intrinsics.
For example, AVX512 has "reduce" intrinsics, or it may be possible to use something like https://www.felixcloutier.com/x86/pcmpistrm