Micro-Architecture Improvements
Intel engineers didn?t just introduce new SIMD instructions support in their new processors, they have also worked on some of their functional units. As a result, they managed to significantly speed up integer and floating-point division and accelerate processing of those SSE instructions that deal with bit shuffling.
Fast division is performed in a special Penryn unit called Fast Radix-16 Divider . While the Radix-4 unit of the 65nm processors on Core micro-architecture could only calculate 2 quotient bits in a single pass of the iteration algorithm, the new unit can handle 4 bits per clock. As a result, Penryn processors can perform integer and floating-point division about twice as fast and work faster on square roots as well.
Intel engineers had to modify the SSE shuffle operations algorithms in order to implement new SSE4 instructions support properly. The new single-pass 128-bit shuffle unit called Super Shuffle Engine can shuffle bits of a 128-bit register in a single clock cycle. As a result, new processors can process SSE instructions that require bits shuffling twice as fast. Among these instructions are operands packing, unpacking, wide shifts, align concatenated sources, insertion and extraction.
Besides speeding up some of the instructions processing, they have also made some improvements to virtualization technology and interrupt masking mechanisms. As a result, Penryn processors can now boast two things: first, they can switch between virtual machines about 25%-75% faster and second, they can process CLI/STI instructions much faster, too.