Decoding
The x86 instructions extracted from the block of bytes are decoded into macro-operations. A macro-op consists of two micro-operations: an integer or floating-point arithmetic micro-op and an address operation for memory access. The splitting into micro-ops is done by the scheduler prior to sending them for execution. The decoder of K8 processors distinguishes between three types of instructions:
- DirectPath Single instructions are decoded into one macro-op in the hardware decoder
- DirectPath Double instructions are decoded into two macro-ops in the hardware decoder
VectorPath instructions are decoded into three or more macro-ops using the on-chip microcode-engine ROM
In a K8 processor, DirectPath and VectorPath instructions cannot be dispatched simultaneously. The decoders are issuing the decoded results at a rate of 3 macro-ops per cycle. Thus, the hardware decoder can decode 3 single instructions, 1 double and 1 single instruction or 1.5 double instructions (3 double instructions per two cycles). Since one VectorPath instruction can be decoded into more than 3 macro-ops, it can take more than 1 cycle to decode such instructions.
The macro-ops produced by the decoder each clock cycle are united into groups. A group consisting of 2 or even 1 macro-op is possible due to alternation of DirectPath and VectorPath commands and to various instruction fetch latencies. Such a group is completed with empty macro-ops so that there are thee macro-ops in total and is then dispatched.
VectorPath instructions from the SSE, SSE2 and SSE3 sets are divided in the K8 processor into pairs of macro-ops that separately process the top and bottom 64-bit parts of a 128-bit SSE register on 64-bit execution units. That?s why such instructions are decoded in the K8 processor at a rate of 3 instructions per 2 clock cycles. The width of the SSE devices in the future K8L processor will be expanded to 128 bits, so there is now no need to split vector instructions in two parts. The algorithm of decoding such instructions will obviously be changed in such a way that vector instructions could be decoded into single 128-bit macro-ops at a rate of 3 instructions per cycle.
Although the decoder of the K8L processor may not be able to decode 4-5 instructions per cycle, just the way Conroe can do it under favorable conditions, it will not hinder programs execution, because the commands are on average executed at less than 3 commands per cycle. K8 usually decodes one x86 instruction into fewer macro-operations than Conroe CPU would do. This, as well as the 32-byte fetch set, make its decoder highly efficient.