# Thread: CPU Branch Prediction as an entropy model

1. ## CPU Branch Prediction as an entropy model

Originally Posted by wikipedia
"The AMD Ryzen processor, previewed on December, 13th, 2016, revealed its newest processor architecture using a neural network based branch predictor to minimize prediction errors."
So here's a "revolutionary" entropy coder based on branch prediction: http://nishi.dreamhosters.com/u/BPCM_v0.rar
Code:
```sample_0: 256 to 79
sample_1: 256 to 77
sample_2: 232 to 227```
sample_0 is 256 x '\x00'
sample_1 is 256 x '1'
sample_2 is c.bat

The method is based on timing (with rdtsc) a function like this:
Code:
```  uint estimate( byte* p, uint l, uint cc ) {
uint c,i,flag;
for( i=0,f0=0,f1=0; i<l; i++ ) {
if( p[i]!=cc ) ++f1; // else ++f1;
}
return (f0<<16)|f1;
}```
This function processes the window buffer with cc=0..255 and takes branches based on symbol matches
(it actually jumps when p[i]==cc - compiler puts a JZ instruction there).
Branch prediction logic is supposed to "learn" the patterns, so for more probable symbols it should run faster
(faster timings turn into higher probabilities).
It kinda even works when its all the same symbol, but not so good with real files (still, can compress them).

My results are for i7 930, so it might be interesting to see what happens on a newer cpu.

2. ## Thanks (5):

encode (16th December 2016),Mike (17th December 2016),quadro (17th December 2016),RamiroCruzo (17th December 2016),xinix (18th December 2016)

3. AMD has been using Neural Net branch predictors for years, e.g. in the Bobcat/Jaguar APU cores (http://amd-dev.wpengine.netdna-cdn.c.../10/apu101.pdf page 34). I believe they already used them in some older Opterons as well, but I don't have a reference to hand.

What's changed between 2011 and now is that Neural Nets are very "in" right now, so they get name-dropped, whereas they weren't that fashionable a few years ago.

4. Yeah, from what I've read, the branch predictors on intel cpus were also basically a CM since Intel Core or so -
recording the history of branches taken/not taken, detecting patterns in that history (loops etc) and estimating the probability.
So I thought that maybe they finally added context mixing to CM in Zen (that would fit with neural network concept,
because logistic activation function in NN is the same as paq mixer).

Originally Posted by wikipedia
The main advantage of the neural predictor is its ability to exploit long histories while requiring only linear resource growth. Classical predictors require exponential resource growth. Jimenez reports a global improvement of 5.7% over a McFarling-style hybrid predictor.
In any case, this coder should compress better with an improved branch predictor.
I'd appreciate any ideas on improving the estimation function though - atm it doesn't seem like branch predictor
is able to estimate the probability of branch taking even roughly.

5. ## Thanks:

xinix (18th December 2016)

6. For example, these are results for files in the form 0...010..01 ie substr(('0' x \$n . '1') x 10000, 0,10000)
Code:
``` 1: 01010101010101010101010101010101010101010...
2: 00100100100100100100100100100100100100100...
3: 00010001000100010001000100010001000100010...
4: 00001000010000100001000010000100001000010...
5: 00000100000100000100000100000100000100000...
6: 00000010000001000000100000010000001000000...
7: 00000001000000010000000100000001000000010...
8: 00000000100000000100000000100000000100000...
9: 00000000010000000001000000000100000000010...```

This looks like it has actual branch history of length 4-5, so loops of length 6..15 are not handled efficiently,
and then after 16 loop predictor starts working (based on distances between jumps).

7. ## Thanks (2):

RamiroCruzo (19th December 2016),xinix (18th December 2016)

8. If you want to predict bits with the branch predictor, you should make your branches correspond more directly with the bits you want to predict.

Most branch predictors use a global history of the last N branches for context, along with the address of the branch instruction. In the "estimate" function you gave, fully half the branches are just for the outer loop over the array (meaning those bits of the history contain essentially zero pertinent information), and the rest are just the indicator function for the question "is x[i] == cc or not?", which is a) likely to be be a very rarely taken branch (below the probability resolution of typical predictors), b) virtually ensures that the (very relevant) information of "if it wasn't cc, what was it?" is not to be found anywhere even in very long prediction histories.

Separately, trying to divine branch predictor behavior from RDTSCs isn't exactly a convenient testing setup. Just use an actual branch predictor! There are regular competitions and the entries come with (C++) source code.
http://www.jilp.org/cbp2016/program.html
http://www.jilp.org/cbp2014/program.html
http://www.jilp.org/jwac-2/program/JWAC-2-program.htm
http://hpca23.cse.tamu.edu/taco/camino/cbp2/source.html (papers here http://hpca23.cse.tamu.edu/taco/camino/cbp2/CBP-2.pdf)
http://www.jilp.org/cbp/Agenda-and-Results.htm

9. ## Thanks (3):

Bulat Ziganshin (17th December 2016),Cyan (17th December 2016),Mike (17th December 2016)

10. > Most branch predictors use a global history of the last N branches for context,
> along with the address of the branch instruction.

I'm running the estimation function multiple times over the window, separately
for each byte value. In theory, if BP is able to predict patterns in the symbol's occurences,
the tsc timing for this symbol would be smaller, and thus the assigned probability
would be higher. Currently the model uses a 64-byte window, and does 8 passes for each symbol value -
imho that should be enough to flush the caches.

> In the "estimate" function you gave, fully half the branches are just for the outer loop over the array

Yes, but these are the same in all iterations, and min tsc is calculated and subtracted from all timings.

I guess I could try eliminating the outer loop branch by generating
unrolled code with symbol branches aligned in a way to cause branch cache
collisions, but that would depend on specific branch cache design too much.

> and the rest are just the indicator function for the question "is x[i] == cc or not?", which is
> a) likely to be be a very rarely taken branch (below the probability resolution of typical predictors),

If cpu branch predictor really estimated the branch-taking probability with any sane precision
(and I mean not 2 bits), there could be enough for a order0 symbol's probability estimation.

Again,
1) estimate() is timed for all byte values 0..255
2) compression actually _works_ - the coder doesn't blow up compressible files.

> b) virtually ensures that the (very relevant) information of "if it wasn't cc, what was it?"
> is not to be found anywhere even in very long prediction histories.

Again, estimate is called for every byte value separately, so I'm basically working with 256 binary strings of (c==cc).
Still, compression works, and although not up to byte, but results are stable enough to
see the correlation with file's contents.

> Separately, trying to divine branch predictor behavior from RDTSCs isn't exactly a convenient testing setup.

That depends on the goal imho.
As it is, I think this coder could be useful to compare branch predictors on significantly different x86 cpus.

And originally it was just a joke - "if there's a hardware context model in the cpu, why not use it for

but the idea here wasn't to build a better branch predictor or anything -
imho there're too many restrictions for it to be interesting.

11. ## Thanks (2):

RamiroCruzo (17th December 2016),xinix (18th December 2016)

12. Originally Posted by Shelwien
I'm running the estimation function multiple times over the window, separately
for each byte value. In theory, if BP is able to predict patterns in the symbol's occurences,
Yeah, but that's just the thing, the way you end up encoding the data into branch decisions wipes out most of the information that a branch predictor actually does modelling on.

The number one thing most branch predictors you're actually going to see in HW try to pick on are correlations between the branch history and the to-be-predicted branch. The way your scheme "encodes" bytes into branch decisions destroys most of the useful information in there, because you're essentially forcing the branch predictor to predict the positions of byte value cc, without having any actual context. It *will* be able to pick up patterns when say a given byte value occurs every 4 positions, but not the more interesting case of characters that tend to occur grouped together etc.

Suppose you have a lexical scanner that contains the check "if (cur_char == '\r' || cur_char == '\n')". When processing Windows-style text files, a typical branch predictor will pick up on the fact that the presence of a LF character is highly correlated with the presence of a CR character in the previous loop iteration. But the line lengths themselves are fairly random, and if you were to first predict the positions of all CRs in a block multiple times, and then predict the positions of all LFs in a block multiple times, it wouldn't work very well. I'm not saying it wouldn't work at all - it obviously does - just that it's not a useful way to gauge the capabilities of the hardware.

Originally Posted by Shelwien
> In the "estimate" function you gave, fully half the branches are just for the outer loop over the array

Yes, but these are the same in all iterations, and min tsc is calculated and subtracted from all timings.
Sure they're always the same, but having half of the (finite-size!) branch history be filled with things that have nothing to do with the data again decreases efficiency further.

Originally Posted by Shelwien
If cpu branch predictor really estimated the branch-taking probability with any sane precision
(and I mean not 2 bits), there could be enough for a order0 symbol's probability estimation.
Why would it? Not a rhetorical question: what's the use of a high-resolution probability estimate here?

A branch predictor isn't generating probability estimates for an arithmetic coder. Its end result is a single bit, "taken" or "not taken", which feeds into a bunch of machinery that ultimately decides which address to continue fetching instructions at.

For an arithmetic coding back-end, the difference between a coin toss with a "heads" probability of P(H)=51%, and one with P(H)=99%, is substantial. For a branch predictor, it's not. It always wants to predict H in both cases (and eats a lot less mispredictions in the second case).

This is important to be clear about. Say you have a "coin toss" (effectively, random order-0) source that's 55% H (heads), 45% T (tails). In that sequence, the goal of the ideal branch predictor is to predict H (the most likely outcome), 100% of the time; if its prediction is a sequence that has 55% H and 45% T, it's doing a really bad job. (Since that sequence is virtually guaranteed to be completely uncorrelated with the actual sequence of events).

Branch predictors have relatively large histories (often >100 branches long, albeit subsampled), and are trying to find context models that produce near-certain predictions. If a given context doesn't have a very good track record at predicting, it's not worth keeping around; that space would be better used on something else.

There are also secondary predictors that just purely keep track of the overall miss rate of a given predictor; this is sometimes called "statistical correction", and relevant to the coin-toss example I mentioned. Basically, whenever you have a branch where the context-model part of the predictor appears to be generating a lot of misses, then it's likely trying to mine structure from noise. So for those branches, you keep track simply of whether they're biased one way or the other, mostly ignore the context, and always predict the most likely (in the past) outcome.

Originally Posted by Shelwien
That depends on the goal imho.
As it is, I think this coder could be useful to compare branch predictors on significantly different x86 cpus.
I agree that yes, this is a thing you can measure, but very much disagree that it's a useful thing to measure, or that it tells you anything relevant about the branch predictor (or how well it does context modelling).

A sensible thing to compare branch predictors on is on branch traces of actual programs (which is what CBP does).
Running your program will indeed tell you something about the branch predictor in a CPU, and I'm willing to believe you that it's reasonably repeatable, but I flat-out disagree about it being a useful way to compare branch predictors.

If you want to know how good a branch predictor is at predicting a bit stream, then I believe that the right way to do that is by feeding in the bit stream as branch history and seeing what it produces. Turning it into a much larger set of indicator functions, thus destroying most of the structure that the built-in context model actually tries to key off of, is still an experiment you can run, but it's not an experiment that tells you much about anything.

13. i7-4790
Code:
```sample_0: 256 to 72
sample_1: 256 to 72
sample_2: 232 to 218```

14. ## Thanks:

Shelwien (17th December 2016)

15. > The way your scheme "encodes" bytes into branch decisions destroys most of
> the useful information in there, because you're essentially forcing the
> branch predictor to predict the positions of byte value cc, without having
> any actual context.

Yes, but its not totally wrong even completely without context - it could work
as order0 model at least.
Also sure, maybe half of the bits of the global branch history are taken
by the loop branch, but then half still remains?
Also, I'm pretty sure that there's also a local history for a specific branch now.

> It *will* be able to pick up patterns when say a given byte value occurs
> every 4 positions, but not the more interesting case of characters that
> tend to occur grouped together etc.

Somehow its pretty bad in "every 4 positions case" actually, but gets better
at "every 16+ position" - see the image http://encode.su/threads/?p=51181&pp=1

> I'm not saying it wouldn't work at all - it obviously does - just that it's
> not a useful way to gauge the capabilities of the hardware.

Yes, its unlikely to be better than a real order0 coder - its still much worse
now though.

Atm I have 4 ideas about improving it:

1. Generating the code for estimate(), which would have a single branch per page,

2. Converting bytewise data to 255 bitstreams first.

Code:
```    volatile uint f0,f1;
for( i=0,f0=0x12345678,f1=0x12345678; i<l; i++ ) {
if( p[i]!=cc ) f0=f1+f0/f1,f0=f1+f0/f1,f0=f1+f0/f1,f0=f1+f0/f1,++f1; // else ++f1;
}```
4. Using some better method to convert timings to probabilities,
likely based on logistic function, currently its just
Code:
`    for( i=1; i<=256; i++ ) tim[i] = qword(t_max+1-t_min)*64/(tim[i]+1-t_min)+1; // (tim[i]-t_min)/(winsize/32)+1;`
> A branch predictor isn't generating probability estimates for an arithmetic
> coder. Its end result is a single bit, "taken" or "not taken", which feeds
> into a bunch of machinery that ultimately decides which address to continue
> fetching instructions at.

Even for a single bit, its likely generating at least 2 bits - see https://en.wikipedia.org/wiki/Branch...rating_counter

Also, if there was no prediction at all, just more clocks to take a branch,
it would be already an equivalent of occurence counter

And again, look at the graph - compression smoothly improves with less

> So for those branches, you keep track simply of whether they're biased one
> way or the other, mostly ignore the context, and always predict the most
> likely (in the past) outcome.

Uh, not quite.
1. Remember, it takes into account the clocks of _all_ branches on a given symbol.
So the timings would be smaller even depending on prediction quality of previous
symbols in the window. As I said, ideally it could perform like order0, or a little better.

2. Its not a fact that there're only 2 decisions on every branch.
There can also be "speculative execution", where both cases would be processed at once,
then one discarded - it could be enabled based on (un)certainty of BP prediction.

> I agree that yes, this is a thing you can measure, but very much disagree
> that it's a useful thing to measure, or that it tells you anything relevant
> about the branch predictor (or how well it does context modelling).

And I agree that this coder can be improved, but disagree that its useless atm -
let's just wait for the results of test with loop files from @encode.

Btw, files are here: http://nishi.dreamhosters.com/u/bpc_testfiles.rar
Run
Code:
`for %a in (*.txt) do BPC.exe c %a %~na.bpc`
Then
Code:
`for %a in (*.bpc) do echo %~za >>results.txt`
to store results.

16. ## Thanks:

xinix (18th December 2016)

17. Got results for i7-3820 ("Sandy Bridge-E") first, thanks to kampaster. Seem the same as encode's (i7-4790, "Haswell-DT") though.

Compare with i7-930 ("Bloomfield")

Seems like longer context (12 bits vs 5?) and much better results overall.

Also seems like both have special handling for the "0101010101..." case (x=1 on graphs)

18. ## Thanks:

xinix (18th December 2016)

19. i7-4790
Code:
```results.txt:
1814
4206
3790
3232
2960
2738
2407
2330
2247
2278
2141
2109
2053
2090
2140
1995
2092
1913
2322
2239
2530
2120
2305
2332
2221
2045
2005
1999
1998
1920
1939
196608```

20. Thanks, this seems even better, I guess history is much longer

21. ## Thanks (2):

encode (17th December 2016),xinix (18th December 2016)

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•