Since the day Intel announced its Pentium 4 processor a lot of questions appeared about the strange results this processor demonstrated in a number of tasks. Although Pentium 4 processors boasted higher working frequencies and specific architectural features, such as Trace Cache, Rapid Execution Engine, Quad-Pumped Bus, Hardware prefetch and even Hyper-Treading, which were supposed to increase the number of commands to be processed per processor clock, Pentium 4 processors turned out unable to outperform their counterparts (Pentium M) as well as their competitors (AMD Athlon) working at lower frequencies. Most reviewers would usually explain these performance issues with the longer pipeline and sometimes with the small cache memory capacity or higher memory latency. Quite rarely some other reasons would be suggested here.
However, all these things I have just mentioned fail to really explain certain anomalies, which you can come across during your tests. As an example, let's consider a situation when we test memory latency with a succession of dependent commands mov eax, [eax] (the so-called pointer-chasing) "with aggravation", when the succession of dependent load commands is enlarged with a succession of ADD operations: X * { mov eax,[eax] - N*{add eax, 0} }.
If we know how long the addition takes, we can determine time T for the load operations as the time required for single iteration processing minus the time required for a succession of N additions. If everything had been fairly simple, then the T (N) dependence graph would have been a horizontal line, with the location determined as the ideal L2 cache access time, i.e. 9=2+7. In reality the graph looks as follows, and it is simply impossible to explain its shape and behavior with the documentation and info Intel's optimization guides offer us:
Pic. 1: Pentium 4 (Northwood) L2 cache latency testing
with a succession X*{mov eax,[eax] - N*{add eax,eax}}.
Luckily there is at least one hint in the optimization guides. This is a very scarce and superficial description of a mechanism called replay. Here is a quote:
?Replay
In order to maximize performance for the common case, the Intel NetBurst micro-architecture sometimes aggressively schedules ops for execution before all the conditions for correct execution are guaranteed to be satisfied. In the event that all of these conditions are not satisfied, ops must be reissued. This mechanism is called replay.
Some occurrences of replays are caused by cache misses, dependence violations (for example, store forwarding problems), and unforeseen resource constraints. In normal operation, some number of replays are common and unavoidable. An excessive number of replays indicate that there is a performance problem.
This scarce explanation gives us to understand that replay may cause serious problems in case a cache-miss occurs. In fact, it occurred to us after reading this description that replay could possibly explain the shape of the L2 cache latency graph. Our search for additional information in official documents and articles ended in vain. All the data we could dig out comes from patents.
So, the article you are about to read appeared as a result of our detailed study of the following Intel patents:
- Patent 6,163,838 ?Computer processor with a replay system?
- Patent 6,094,717 ?Computer processor with a replay system having a plurality of checkers?
- Patent 6,385,715 ?Multi-threading for a processor utilizing a replay queue?
Also we carried out and analyzed the whole bunch of benchmarks. We paid most attention to Northwood processor core here. As for the detailed study of the Prescott processor core, we are still working on it, as it requires a lot of time and resources.