Closer Look at AMD Phenom Processor
General Information
No matter what the outcome of our today’s test session is going to be, we cannot deny that AMD managed to outpace Intel with the launch of a true quad-core processor. While, Core 2 Quad processors manufactured with 65nm and 45nm process are none other but two dual-core processors combined together in a single package, Phenom is a fully-fledged quad-core solution. This processor die (currently manufactured with 65nm production process) contains four cores at the same time.
This AMD’s approach to multi-core processor design allowed company engineers to implement functional units shared between all four cores. These are the units implemented one of each in the new AMD Phenom: controller for HyperTransport bus that serves to transfer data from CPU to the chipset, DDR2 memory controller and L3 cache. AMD has already used a shared HyperTransport bus and memory controller in their dual-core Athlon 64 X2 processors before. So we were not surprised to see them in the new AMD Phenom, too.
The shared L3 cache, however, is being used for the first time. The current Phenom models have a 2MB L3 cache. The bandwidth of this cache is not very high compared with the memory subsystem performance, however it boasts pretty low latency. Moreover, it allows to significantly speed up the data transfer rate between the processor cores without loading the memory bus: this is actually its major purpose in the new processors.
However, putting four processor cores onto a single dies also has a negative side to it. The thing is that these semiconductor dies, even manufactured with contemporary 65nm process, come out pretty big in size. Of course, it leads to significant drop of the chip yields and increase in production costs. However, AMD seems to have found a way to put most of the defective dies to good use. Next year they will start supplying triple-core and maybe even dual-core processors manufactured from original Phenom dies with one or two failed cores.
Another issue resulting from large die size of the new processors is their relatively low clock frequencies, because they have to watch out for growing heat dissipation of the CPU. While quad-core Intel processors manufactured with 65nm process are currently running at up to 3.0GHz speeds, AMD will hardly be able to introduce a Phenom with the clock frequencies beyond 2.6GHz in the nearest future. Moreover, the currently announced models work only at 2.2GHz-2.3GHz. It looks like they will be able to resolve this issue only in H2 2008, when AMD is planning to switch production of their quad-core processors to more advanced 45nm manufacturing technology.
So, it’s time to compare the basic specifications of the new Phenom processor against those of its main quad-core opponent – Core 2 Quad from Intel. The table below shows two Intel Core 2 Quad models: the old one codenamed Kentsfield and a new one codenamed Yorkfield manufactured with 45nm process and scheduled to become widely spread in early 2008.
New Phenom processors are extremely interesting not only due to their four processor cores on a single die. AMD engineers managed to introduce a number of improvements into the actual micro-architecture thus making the actual cores work faster (compared with the Athlon 64 processors). And although core micro-architecture in Phenom processors doesn’t differ too much from the classical K8 micro-architecture, AMD used a new codename for it: K10 - that was later replaced with a more poetic “Stars Microarchitecture”.
We have a separate article on our site devoted to all the details of the revised micro-architecture. So here we are going to briefly list all the innovations made in the AMD Phenom processors:
- Wider data path between the execution units and the L1 processor cache, and between L1 and L2 cache. The bus width between L1 and L2 cache of the new Phenom processors was increased to 128bit in each direction and the CPU can now perform two 128-bit data loads from L1 cache per clock cycle.
- Advanced memory prefetcher. Phenom processors can now deliver the data directly to L1 cache without loading it first into the L2 cache that would inevitably increase latency. Moreover, the memory fetcher recognizes repeated "stride patterns"and predicts which information to prefetch.
- 32-byte instruction fetching. The code is loaded into Phenom decoder in 32-byte, not 16-byte blocks. This allows reducing the idling time of the processor execution units.
- Improved branch prediction. Processors with revised micro-architecture started processing indirect branching correctly. This improved significantly the probability of correct branch prediction in programs created with object-oriented languages and contemporary compilers.
- Speculative out-of-order data loading. Like CPUs on Core micro-architecture, Phenom may deal with unidentified data loads ahead of other operations that may alter this data.
- New Sideband Stack Optimizer algorithm. It allows lowering the resource expenses during stack operations thanks to independent monitoring of the ESP register status.
- Implementation of 128-bit floating-point units (compare with the 64-bit floating-point units in Athlon 64 processors). As a result, each Phenom core can process up to 4 FPU instructions with double-precision per clock, and most 128-bit SSE operations can be processed within a single clock. Moreover, new processor support new SSE instructions from the SSE4A set. Although SSE4A is incompatible with SSE4.1 supported by new Intel processors.
- Improved virtualization technology speeding up the performance in applications run on virtual machines.