| • Science | • People | • Locations | • Timeline |
| Contents | ||
a processor that determines whether a branch in the
instruction flow of a program is likely to be taken or not. Thisis called branch prediction. Branch predictors are crucial in today's modern, superscalar processors for achieving high performance. They allow processors to fetch and execute instructions without waiting for a branch to be resolved.
All pipelined processors do branch prediction of some form, because they must guess the address of the next instruction to fetch before the current instruction has been executed. Earlier microprogrammed CPUs did not have to do branch prediction.
The early implementations of SPARC and MIPS (two of the first commercial RISC architectures) did trivial branch prediction: they always predicted that a branch would not be taken, so they always fetched the next sequential instruction. Only when a branch or jump was evaluated did the instruction fetch pointer get set to a nonsequential address.
Both CPUs evaluated branches in the decode stage and had a single cycle instruction fetch. As a result, the branch target recurrence was two cycles long, and the machine would always fetch the instruction immediately after any taken branch. Both architectures defined branch delay slots in order to utilize these fetched instructions.
Some superscalar processors (MIPS R8000, Alpha EV6 and EV8) fetched with each line of instructions a pointer to the next line. This next line predictor is not directly comparable to the other predictors listed here, because it handles both the branch target prediction as well as the branch direction prediction.
When a next line predictor points to aligned groups of 2, 4 or 8 instructions, the branch target will usually not be the first instruction fetched, and so the initial instructions fetched are wasted. Assuming for simplicity a uniform distribution of branch targets, 0.5, 1.5, and 3.5 instructions fetched are discarded, respectively.
Since the branch itself will generally not be the last instruction in an aligned group, instructions after the taken branch (or its delay slot) will be discarded. Once again assuming a uniform distribution of branch instruction placements, 0.5, 1.5, and 3.5 instructions fetched are discarded.
The discarded instructions at the branch and destination lines add up to nearly a complete fetch cycle, even for a single-cycle next-line predictor.
A bimodal branch predictor has a table of two-bit saturating counters, indexed with the least significant bits of the instruction addresses. Unlike the instruction cache, bimodal predictor entries typically do not have tags, and so a particular counter may be mapped to different branch instructions (this is called branch interference or branch aliasing), in which case it is likely to be less accurate. Each counter has one of four states:
When a branch is evaluated, the corresponding counter is updated. Branches evaluated as not taken decrement the state towards strongly not taken, and branches evaluated as taken increment the state towards strongly taken. The primary benefit of this two bit saturating counter scheme is that loop closing branches are always predicted taken. A one-bit scheme (like the R8000), mispredicts both the first and last branch of a loop. A two-bit scheme mispredicts just the last branch. Similarly, on heavily biased branches which almost always go one way, a one-bit scheme mispredicts twice for each odd branch, and a two-bit scheme mispredicts once.
On the SPEC'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter.
Because the bimodal counter table is indexed with the instruction address bits, a superscalar processor can split the table into separate SRAMs for each instruction fetched, and fetch a prediction for every instruction in parallel with fetching the instruction, so that the branch prediction is available as soon as the branch is decoded.