C.3.2 Store Forwarding

To see the first complication, a violation of self-consistency, consider the following code with variables ``a'' and ``b'' both initially zero, and with the cache line containing variable ``a'' initially owned by CPU 1 and that containing ``b'' initially owned by CPU 0:



  1   a = 1;
  2   b = a + 1;
  3   assert(b == 2);


One would not expect the assertion to fail. However, if one were foolish enough to use the very simple architecture shown in Figure [*], one would be surprised. Such a system could potentially see the following sequence of events:

  1. CPU 0 starts executing the a = 1.
  2. CPU 0 looks ``a'' up in the cache, and finds that it is missing.
  3. CPU 0 therefore sends a ``read invalidate'' message in order to get exclusive ownership of the cache line containing ``a''.
  4. CPU 0 records the store to ``a'' in its store buffer.
  5. CPU 1 receives the ``read invalidate'' message, and responds by transmitting the cache line and removing that cacheline from its cache.
  6. CPU 0 starts executing the b = a + 1.
  7. CPU 0 receives the cache line from CPU 1, which still has a value of zero for ``a''.
  8. CPU 0 loads ``a'' from its cache, finding the value zero.
  9. CPU 0 applies the entry from its store queue to the newly arrived cache line, setting the value of ``a'' in its cache to one.
  10. CPU 0 adds one to the value zero loaded for ``a'' above, and stores it into the cache line containing ``b'' (which we will assume is already owned by CPU 0).
  11. CPU 0 executes assert(b == 2), which fails.

The problem is that we have two copies of ``a'', one in the cache and the other in the store buffer.

This example breaks a very important guarantee, namely that each CPU will always see its own operations as if they happened in program order. Breaking this guarantee is violently counter-intuitive to software types, so much so that the hardware guys took pity and implemented ``store forwarding'', where each CPU refers to (or ``snoops'') its store buffer as well as its cache when performing loads, as shown in Figure [*]. In other words, a given CPU's stores are directly forwarded to its subsequent loads, without having to pass through the cache.

Figure: Caches With Store Forwarding
\resizebox{3in}{!}{\includegraphics{appendix/whymb/cacheSBf}}

With store forwarding in place, item [*] in the above sequence would have found the correct value of 1 for ``a'' in the store buffer, so that the final value of ``b'' would have been 2, as one would hope.

Paul E. McKenney 2011-12-16