Floating Point Unit
The floating point unit (FPU) scheduler of K8 and K10 processors is separated from the integer unit scheduler and is designed in a slightly different way. The scheduler buffer can accommodate up to 12 groups with 3 MOPs each (theoretically - 36 floating point operations). Unlike the integer unit with symmetrical pipes, the FPU consists of three different units: FADD for floating-point addition, FMUL for floating-point multiplication and FMISC (also known as FSTORE) for operations of saving in the memory and auxiliary operations. Therefore, the scheduler buffer doesn’t assign each specific MOP in a group to a particular unit (Pic.4):
Pic.4: Floating Point Unit
Each clock cycle K8 and K10 may issue one operation to each floating point unit for execution. K8 processor features 80-bit floating-point units. At a decoding stage vector 128-bit SSE-instructions are split into two MOPs that process 64-bit halves of a 128-bit operand and are executed successively in different clocks. It not only slows down vector instructions processing, but cuts down in half the effective size of the FPU scheduler buffer and as a result reduces the depth of out-of-order instructions execution.
K10 processor has the floating-point units width increased to 128 bit. K10 processes vector 128-bit operands in a single operation, which doubles the theoretical processing speed for vector SSE-instructions compared with K8. Moreover, since there are twice fewer MOPs now, the effective length of the scheduler queue increases, which allows for deeper out-or-order execution.
K8 processor performed loading SSE-instructions using FSTORE unit. On the one hand, it doesn’t allow any other instructions requiring this unit to be executed at the same time, and on the other ?limits the number of simultaneously launched load instructions to one only. K8 can perform two parallel read from the memory only if one of the instructions combines a memory request and a data operation (the so-called Load-Execute instruction), for example, ADDPS xmm1, [esi].
K10 processor boasts improved mechanism for loading SSE-instructions.
Firstly, data load instructions no longer use FPU resources. This way FSTORE port is free now and available for other instructions. Load instructions can now be executed two per clock.
Secondly, if the data in the memory is aligned along 16-byte boundary, the unaligned data loading (MOVU**) works as efficient as aligned data loading (MOVA**). So, the using MOVA** doesn’t bring any advantages for K10 processors any more.
Thirdly, K10 processors can now use unaligned loading even for Load-Execute instructions that combine loading with the data operations. If it is unclear whether the data in memory is aligned, the compiler (or programmer) usually uses MOVU** instruction to read the data into registers for further processing. By using unaligned loading together with the Load-Execute instructions, they can reduce the number of individual load instructions in the program code and hence increase the performance. Compilers should have the support of this feature integrated. Actually, SSE specification developed by Intel suggests that a request from Load-Execute instruction issued
to an address that hasn’t been aligned along 16-byte boundary should lead to exception. To retain compatibility with the spec, the unaligned loads with Load-Execute instructions should be allowed by a special flag in the program code designed and compiled taking into account new processor features.
Fourthly, two buses for data reading from the L1 cache of the K10 processor were expanded to 128 bit. As a result the CPU can read two 128-bit data blocks each clock. This is a very important architectural peculiarity, because 4 operands are required for 2 instructions to be executed in parallel at the same time (2 per instruction), and in some algorithms of streaming data processing two of four operands are usually read from RAM. On the contrary, two buses fort data writing in the K10 processor remained 64 bits wide and 128-bit result is split into two 64-bit packets when written to memory. So, every clock the CPU can only make one 128-bit write or two 128-bit reads, or one 128-bit read and one 64-bit write. However, since the number of reads is usually at least twice as large as the number of writes, writing limitations shouldn’t really affect the processor efficiency during 128-bit data processing.
Fifthly, 128-bit data copying, MOV*** register-register, can now be performed in any of the three FPU units and not only in FADD and FMUL. As a result it also frees FADD and FMUL units for dedicated operations.
As we see, the FPU of K10 processor became much more flexible. It acquired some unique features that Intel processors don’t have yet, namely, efficient unaligned loading, including Load-Execute instructions, and two 128-bit reads per clock cycle. Unlike Core 2, floating-point and integer schedulers use separate queues. Separate queues eliminate operations conflicts caused by use of the same execution ports. However, K10 still shares the FSTORE unit for SSE save operations with some data transformation instructions, which may sometimes affect their processing speed.
All in all, the K10 FPU promises to be pretty efficient and more advanced than the FPU of Core 2 (for example, thanks to two 128-bit reads per clock and effective unaligned loading).