Integer Execution Unit
The Integer Execution Unit of K8 and K10 processors consists of three symmetrical integer pipes. Each of these pipes has its own scheduler with an 8-MOP queue, an identical set of integer arithmetic and logical units (ALU), address generation units (AGU) and a branch prediction unit. Moreover, there is a multiplication unit connected to pipe 0, and pipe 2 is tied to the execution unit for new operations: LZCNT and POPCNT, which we are going to discuss in detail later in this article.
Pic. 3: Integer Execution Unit
The queue choice for each MOP is determined by the static location of the instruction in the triplet. Each macro-operation from the triplet is dispatched from the reorder buffer to be executed in its turn. On the one hand it simplifies instructions control, but on the other ?may result in ill-balanced load on the queues in case a chain of dependent operations is not very favorably placed in the program code (in reality this occurs very rarely and hardly affects the actual performance). The decoder places multiplication and extended bit operations in the corresponding triplet slots, so that they could fall into the proper pipe.
As we have already said before, MOPs are split into integer operations and addressed memory requests in scheduler queues of the integer pipes. Upon the availability of data, the scheduler may issue one integer operation to ALU and one address operation to AGU from each queue. There can be maximum two simultaneous memory requests. So, up to 3 integer operations and 2 memory operations (64-bit read/write in any combination) may be issue for execution per clock. Micro-operations from various arithmetic MOPs are issued for execution from their queues in an out-of-order manner, depending on the readiness of the data. As soon as the arithmetic and address MOP micro-operations are executed, this MOP is removed from the scheduler queue giving room for new operations.
K8 processor selects memory request address micro-operations on a program level. The memory requests that occur later in the program code cannot be executed ahead of the earlier ones. As a result, if the address for an earlier operation cannot be calculated, all following address operations get blocks, even if the operands for them are already ready.
For example:
add ebx, ecx
mov eax, [ebx+10h] ?quick address calculation
mov ecx, [eax+ebx] ?address depends on the result of the previous instruction
mov edx, [ebx+24h] ?this instruction will not be sent for execution until the addresses for all previous instructions have been calculated.
This may cause performance losses and is one of grave K8 processor bottlenecks. As a result, although K8 processor can process two read instructions per clock, in some codes it may execute memory requests less efficiently than Core 2 processor, which launches one read instruction per clock, but applies speculative out-of-order instructions execution and can jump over preceding reads and writes.
CPUs with K10 micro-architecture do not suffer from this bottleneck any more. K10 processors now can not only process out-of-order reads, but even process the writes before reads if the CPU is certain that no address conflict between these reads and writes exists. By launching writes ahead of reads, the processor can significantly speed up the processing of some types of code, such as read cycles beginning with another data read from the memory and finishing with writing the calculations result into the memory.
L1:
mov eax, [esi] // data loading
.....// data processing
mov [edi] , eax // storing result
cmp
jnz L1
in situations like that the processor that cannot process the read before the write, cannot begin processing next clock cycle iteration before the result of the current one has been written into the memory completely. CPUs supporting read reordering, can start loading new data for the next iteration without waiting for the current one to be completed.
Unfortunately, K10 processor cannot perform speculative loading ahead of writing completion if the address is unknown yet, like Core 2 processor do. Although these speculations may sometimes result into penalties, these are very rare occurrences in real program code (only about 5% cases) that is why speculative loading is absolutely justified from the performance increase prospective.
Another improvement of the K10 integer unit is the optimization of the integer division algorithm. Now fastness of integer division operation depends on the most significant bits of the dividend and divider. For example, if the dividend equals 0, division takes almost half the time. Actually, integer division is a very rare operation. Since it is usually pretty slow it is being carefully avoided in real program codes most of the time. They usually replace it with multiplication by the reciprocal, with shifts or other means, that is why this optimization is very unlikely to have any significant effect on the overall applications performance.
All in all, K10 integer unit will be pretty efficient. Once they added out-of-order memory requests processing, there are no evident bottlenecks in it any more. Although K10 doesn?t have the queues as deep as Core 2 processors, it is free from limitations on reading from its register file as well as some other scheduling limitations that do not let Core 2 processors to always perform the operations at maximum speed.