C.3.3 Store Buffers and Memory Barriers

To see the second complication, a violation of global memory ordering, consider the following code sequences with variables ``a'' and ``b'' initially zero:



  1 void foo(void)
  2 {
  3   a = 1;
  4   b = 1;
  5 }
  6
  7 void bar(void)
  8 {
  9   while (b == 0) continue;
 10   assert(a == 1);
 11 }


Suppose CPU 0 executes foo() and CPU 1 executes bar(). Suppose further that the cache line containing ``a'' resides only in CPU 1's cache, and that the cache line containing ``b'' is owned by CPU 0. Then the sequence of operations might be as follows:

  1. CPU 0 executes a = 1. The cache line is not in CPU 0's cache, so CPU 0 places the new value of ``a'' in its store buffer and transmits a ``read invalidate'' message.
  2. CPU 1 executes while (b == 0) continue, but the cache line containing ``b'' is not in its cache. It therefore transmits a ``read'' message.
  3. CPU 0 executes b = 1. It already owns this cache line (in other words, the cache line is already in either the ``modified'' or the ``exclusive'' state), so it stores the new value of ``b'' in its cache line.
  4. CPU 0 receives the ``read'' message, and transmits the cache line containing the now-updated value of ``b'' to CPU 1, also marking the line as ``shared'' in its own cache.
  5. CPU 1 receives the cache line containing ``b'' and installs it in its cache.
  6. CPU 1 can now finish executing while (b == 0) continue, and since it finds that the value of ``b'' is 1, it proceeds to the next statement.
  7. CPU 1 executes the assert(a == 1), and, since CPU 1 is working with the old value of ``a'', this assertion fails.
  8. CPU 1 receives the ``read invalidate'' message, and transmits the cache line containing ``a'' to CPU 0 and invalidates this cache line from its own cache. But it is too late.
  9. CPU 0 receives the cache line containing ``a'' and applies the buffered store just in time to fall victim to CPU 1's failed assertion.

Quick Quiz C.6: In step 1 above, why does CPU 0 need to issue a ``read invalidate'' rather than a simple ``invalidate''? End Quick Quiz

The hardware designers cannot help directly here, since the CPUs have no idea which variables are related, let alone how they might be related. Therefore, the hardware designers provide memory-barrier instructions to allow the software to tell the CPU about such relations. The program fragment must be updated to contain the memory barrier:



  1 void foo(void)
  2 {
  3   a = 1;
  4   smp_mb();
  5   b = 1;
  6 }
  7
  8 void bar(void)
  9 {
 10   while (b == 0) continue;
 11   assert(a == 1);
 12 }


The memory barrier smp_mb() will cause the CPU to flush its store buffer before applying each subsequent store to its variable's cache line. The CPU could either simply stall until the store buffer was empty before proceeding, or it could use the store buffer to hold subsequent stores until all of the prior entries in the store buffer had been applied.

With this latter approach the sequence of operations might be as follows:

  1. CPU 0 executes a = 1. The cache line is not in CPU 0's cache, so CPU 0 places the new value of ``a'' in its store buffer and transmits a ``read invalidate'' message.
  2. CPU 1 executes while (b == 0) continue, but the cache line containing ``b'' is not in its cache. It therefore transmits a ``read'' message.
  3. CPU 0 executes smp_mb(), and marks all current store-buffer entries (namely, the a = 1).
  4. CPU 0 executes b = 1. It already owns this cache line (in other words, the cache line is already in either the ``modified'' or the ``exclusive'' state), but there is a marked entry in the store buffer. Therefore, rather than store the new value of ``b'' in the cache line, it instead places it in the store buffer (but in an unmarked entry).
  5. CPU 0 receives the ``read'' message, and transmits the cache line containing the original value of ``b'' to CPU 1. It also marks its own copy of this cache line as ``shared''.
  6. CPU 1 receives the cache line containing ``b'' and installs it in its cache.
  7. CPU 1 can now load the value of ``b'', but since it finds that the value of ``b'' is still 0, it repeats the while statement. The new value of ``b'' is safely hidden in CPU 0's store buffer.
  8. CPU 1 receives the ``read invalidate'' message, and transmits the cache line containing ``a'' to CPU 0 and invalidates this cache line from its own cache.
  9. CPU 0 receives the cache line containing ``a'' and applies the buffered store, placing this line into the ``modified'' state.
  10. Since the store to ``a'' was the only entry in the store buffer that was marked by the smp_mb(), CPU 0 can also store the new value of ``b'' -- except for the fact that the cache line containing ``b'' is now in ``shared'' state.
  11. CPU 0 therefore sends an ``invalidate'' message to CPU 1.
  12. CPU 1 receives the ``invalidate'' message, invalidates the cache line containing ``b'' from its cache, and sends an ``acknowledgement'' message to CPU 0.
  13. CPU 1 executes while (b == 0) continue, but the cache line containing ``b'' is not in its cache. It therefore transmits a ``read'' message to CPU 0.
  14. CPU 0 receives the ``acknowledgement'' message, and puts the cache line containing ``b'' into the ``exclusive'' state. CPU 0 now stores the new value of ``b'' into the cache line.
  15. CPU 0 receives the ``read'' message, and transmits the cache line containing the new value of ``b'' to CPU 1. It also marks its own copy of this cache line as ``shared''.
  16. CPU 1 receives the cache line containing ``b'' and installs it in its cache.
  17. CPU 1 can now load the value of ``b'', and since it finds that the value of ``b'' is 1, it exits the while loop and proceeds to the next statement.
  18. CPU 1 executes the assert(a == 1), but the cache line containing ``a'' is no longer in its cache. Once it gets this cache from CPU 0, it will be working with the up-to-date value of ``a'', and the assertion therefore passes.

As you can see, this process involves no small amount of bookkeeping. Even something intuitively simple, like ``load the value of a'' can involve lots of complex steps in silicon.

Paul E. McKenney 2011-12-16