Please check out the first article from the trilogy called Prescott: The Last of the Mohicans? (Pentium 4: from Willamette to Prescott) here.
Chapter VIII: How We Heard about Replay
No matter how exciting the mysteries of the Pentium 4 performance are, the article cannot be endless. But, it is high time we described something we discovered during our Pentium 4 micro-architecture investigation. In fact, this something is exactly the reason why we decided to write this article. Well, this something anyway turned out a pretty mysterious thing from the structural point of view as well as according to the official documentation.
This is how the whole thing happened.
Awhile after we started working on this article, when we had a sort of draft write-up of the first seven chapters of the article, it seemed that the major traits of the Pentium 4 micro-architecture were already quite known to us. We were very happy about the article being almost finished and applied all our efforts to polishing off the small details: checking cache latencies and comparing the results with what the documents stated. Although we didn't question the data in the official papers, we had our own measurement techniques developed while working on the previous article about Athlon 64/Opteron micro-architecture, so we really longed for a fair comparison. Especially, since at the time we were writing the article called 揂MD: Per Aspera Ad Astra? we noticed that the Pentium 4 processor behaved kind of weird: the results of the L2 cache latency measurements didn't make any sense. We had to find an explanation to this phenomenon.
The test of Northwood based Pentium 4 processor was carried out by a dependent commands chain, like move eax, [eax] (the co-called pointer-chasing). Theoretically, everything was supposed to be predictable here: according to the documentation, the L1 cache latency equals 2 clocks, and the L2 cache latency ?7 clocks. In other words, we expected to get the total latency of 9 clocks.
Our reaction to the actual results we obtained can best be described with the phrase 搒truck dumb? The problem aroused from where no one would expect it to. Instead of the firmly stated latency value from the white-paper, we were facing something unbelievable, looking more like a cardiogram of a heart patient.
First of all, we never saw anything similar to the expected (and claimed by the documentation!) 9 clocks. The CPU managed to somehow generate tens of clocks of latency time instead.
It seemed that we had to undertake pretty evident measures to find out where this whole show was coming from: check the white-paper again. But, it was not that simple. There was no mention of any phenomena like what we were witnessing. The optimization guides also didn't contain any answers to our numerous questions.
Moreover, this behavior of our CPU made us ask ourselves: did we really study all the principal subsystems of the Pentium 4 micro-architecture carefully enough? Maybe there is some important subsystem, which hasn't been described yet? Does this effect we have just discovered affect the processor performance in any way? And if it does, then how big is this influence? Will we see the same in any real applications at all?