Instructions Fetch
Processor starts the code processing from fetching instructions from the L1I instruction cache and their decoding. x86 instructions have variable length, which makes it harder to determine their boundaries before decoding starts. To ensure that the identification of the instructions length doesn?t affect the decoding speed, K8/K10 processors decode instructions while the lines are being loaded into L1I cache. Instruction labeling info is stored in special fields of the L1I cache (3bits of predecoding info per each byte of instructions). By performing the predecoding during loading into cache the instructions boundaries can be determined beyond the decoding pipes, which allows maintaining steady decoding rate independent of instructions format and length.
Processors load blocks of instructions from the cache and then pick out instructions that need to be sent for decoding. A CPU on K10 micro-architecture fetches instructions from the L1I cache in aligned 32-byte blocks, while K8 and Core 2 processors fetch instructions in 16-byte blocks. At 16 bytes per clock the instructions are fetched fast enough for K8 and Core 2 processors to send three instructions with the average length of 5 bytes for decoding every clock cycle. However, some x86 instructions may be 16 bytes long and in some algorithms the length of a few adjacent instructions may be greater than 5 bytes. As a result, it is impossible to decode three instructions per clock in such cases. (Pic.1).
Pic 1: A few adjacent long instructions limit the decoding speed
during instructions fetch 16-byte blocks.
Namely, SSE2 ?a simple instruction with operands of register-register type (for example, movapd xmm0, xmm1 ) ?is 4 bytes long. However, if the instruction generates addressed memory requests using the base register and offset (for example, movapd xmm0, [eax+16] ), the instruction increases up to 6-9 bytes, depending on the offset. If additional registers are involved in 64-bit mode, there is one more single-byte REX-prefix added to the instruction code. This way, SSE2 instructions in 64-bit mode may become 7-10 bytes long. SSE1 instructions are 1 byte shorter, if it is a vector instruction (in other words, if it works on four 32-bit values). But if it is a scalar SSE1 instruction (on one operand) it can also be 7-10 bytes long in the same conditions.
Fetching maximum 16-byte blocks is not a limitation for K8 processor in this case, because it cannot decode vector instructions faster than 3 per 2 clocks anyway. However, for K10 architecture a 16-byte block could become a bottleneck, so increasing the maximum fetch block size to 32 bytes is an absolutely justified measure.
By the way, Core 2 processors fetch 16-byte instruction blocks, just like K8 processors, that is why they can decode efficiently 4 instructions per clock cycle if the average instruction length doesn?t exceed 4 bytes. Otherwise, the decoder will not be able to process 4 or even 3 instructions per clock efficiently enough. However, Core 2 processors feature a special internal 64-byte buffer that stores the last four requested 16-byte blocks. The instructions are fetched from this buffer at the rate 32 bytes per clock speed. This buffer allows caching short cycles, removing their fetching speed limitations and save up to 1 clock cycle each time the prediction to move to the cycle beginning is made. Although the cycles shouldn?t have more than 18 instructions, more than 4 conditional branches and no ret instructions in them.