Memory Subsystem
The K8 has a separated L1 cache: 64KB for instructions and data each. Each cache is 2-way set associative; the line length is 64 bytes. This low associativity is due to the ability of the core to perform two reads from the cache per clock cycle. It is compensated by the rather large size (64KB); in inconvenient algorithms the effective size of the cache will be close in efficiency to a 32KB cache with an associativity of 4. In algorithms with sequential access a large L1 cache that allows performing two reads per cycle looks preferable.
The L2 cache (paired with the L1 cache) is exclusive: the data in the L1 and L2 caches are not duplicated. The L1 and L2 caches exchange data across two unidirectional buses (one goes from the L1 to the L2 and one goes from the L2 to the L1), each 64 bits or 8 bytes wide (Figure 6). With this organization, the processor receives data from the L2 cache at a rather slow rate of 8 bytes per clock (8 clocks to transfer a 64-byte line). As a result, the data transfer latency is high, especially when two or more lines in the L2 cache are being accessed simultaneously. The latency is somewhat compensated by the increased number of cache hits due the high associativity of the L2 cache, which is 16, and due to the larger total amount of cache memory (thanks to the exclusive design).
Fig. 6
A memory controller resides on the same silicon wafer with the CPU core. Data flows from the memory controller through the crossbar and directly into the L1 cache, bypassing the L2 cache. This reduces the latency of data transfers from system RAM. The L2 cache only receives those data that have been pushed out of the L1.
As was mentioned above, the K8L processor’s L1 cache can provide up to two 128-bit reads or one read and one write per clock. Unfortunately, there is no info that the width or design of the bus that connects the L1 and L2 caches will be changed. We hope, however, that it will be at least broadened twofold. Otherwise, the slow inter-cache bus is going to limit the CPU performance when running code with streaming floating-point instructions: the powerful computational resources of the CPU will be just idle, waiting for data that are being read from the L2 at a slow rate. The associativity of the L1 cache won’t be increased, either, so it looks like we can’t hope for any performance miracles from the new processor in this respect.
The K8L will have a shared (among 4 or fewer cores) L3 cache. This cache will reside on the same wafer with the cores and will have a capacity of 2 or more megabytes. The L3 cache will most likely be exclusive, like the L2 cache. Combined with an enhanced crossbar, the L3 cache will solve the problem of low speed of transfers of modified data between the caches of neighboring cores which are performed via the memory bus in the K8 (for details see our article called 's Investigation: Data Transfer Rate between the Cores in Dual-Core Processors). This problem is largely solved in the Conroe by sharing the L2 cache between the two cores. Thus, the quad-core K8L will most likely be close to the Conroe core in its inter-core data transfer characteristics. What’s curious, it is now reported that the future quad-core Intel processor won’t have its cache shared among the four cores. So, the same problems as we see now in a dual-core processor without a shared cache (an insufficiently high speed of data transfers necessary to maintain cache coherency) may appear in it, too.
Conroe-core processors have a well-developed caching system: a 32KB L1 cache with an associativity of 8 and a 2-4MB L2 cache with an associativity of 16 are linked with a full-speed 256-bit bus. The processor has highly efficient prefetch units that can aggressively load data not only from system RAM but also from the L2 cache into the L1 cache. It is the enhanced caching system that is largely responsible for the greatly increased results of the Conroe in SPEC INT (by 40% in some subtests) in comparison with the previous Yonah core.
To all appearances, there won’t be data prefetching from the L2 cache into the L1 cache and there won’t be high-efficiency prefetch units in the K8L due to the peculiarities of its caching system. If the mentioned deficiencies of the caching subsystem of the K8 core are not eliminated in the K8L, the new processor won’t be able to show its full potential and will be less efficient than the Conroe.
Besides the mentioned innovations, the K8L’s memory subsystem is going to be modernized in some other ways like support for upcoming DDR3 SDRAM and FB-DIMM, and for Hyper-Transport 3. But these improvements will hardly have a great impact on the performance of desktop computers.