C.4.3 Invalidate Queues and Memory Barriers

Let us suppose that CPUs queue invalidation requests, but respond to them immediately. This approach minimizes the cache-invalidation latency seen by CPUs doing stores, but can defeat memory barriers, as seen in the following example.

Suppose the values of ``a'' and ``b'' are initially zero, that ``a'' is replicated read-only (MESI ``shared'' state), and that ``b'' is owned by CPU 0 (MESI ``exclusive'' or ``modified'' state). Then suppose that CPU 0 executes foo() while CPU 1 executes function bar() in the following code fragment:



  1 void foo(void)
  2 {
  3   a = 1;
  4   smp_mb();
  5   b = 1;
  6 }
  7
  8 void bar(void)
  9 {
 10   while (b == 0) continue;
 11   assert(a == 1);
 12 }


Then the sequence of operations might be as follows:

  1. CPU 0 executes a = 1. The corresponding cache line is read-only in CPU 0's cache, so CPU 0 places the new value of ``a'' in its store buffer and transmits an ``invalidate'' message in order to flush the corresponding cache line from CPU 1's cache.
  2. CPU 1 executes while (b == 0) continue, but the cache line containing ``b'' is not in its cache. It therefore transmits a ``read'' message.
  3. CPU 1 receives CPU 0's ``invalidate'' message, queues it, and immediately responds to it.
  4. CPU 0 receives the response from CPU 1, and is therefore free to proceed past the smp_mb() on line 4 above, moving the value of ``a'' from its store buffer to its cache line.
  5. CPU 0 executes b = 1. It already owns this cache line (in other words, the cache line is already in either the ``modified'' or the ``exclusive'' state), so it stores the new value of ``b'' in its cache line.
  6. CPU 0 receives the ``read'' message, and transmits the cache line containing the now-updated value of ``b'' to CPU 1, also marking the line as ``shared'' in its own cache.
  7. CPU 1 receives the cache line containing ``b'' and installs it in its cache.
  8. CPU 1 can now finish executing while (b == 0) continue, and since it finds that the value of ``b'' is 1, it proceeds to the next statement.
  9. CPU 1 executes the assert(a == 1), and, since the old value of ``a'' is still in CPU 1's cache, this assertion fails.
  10. Despite the assertion failure, CPU 1 processes the queued ``invalidate'' message, and (tardily) invalidates the cache line containing ``a'' from its own cache.

Quick Quiz C.7: In step 1 of the first scenario in Section [*], why is an ``invalidate'' sent instead of a ''read invalidate'' message? Doesn't CPU 0 need the values of the other variables that share this cache line with ``a''? End Quick Quiz

There is clearly not much point in accelerating invalidation responses if doing so causes memory barriers to effectively be ignored. However, the memory-barrier instructions can interact with the invalidate queue, so that when a given CPU executes a memory barrier, it marks all the entries currently in its invalidate queue, and forces any subsequent load to wait until all marked entries have been applied to the CPU's cache. Therefore, we can add a memory barrier to function bar as follows:



  1 void foo(void)
  2 {
  3   a = 1;
  4   smp_mb();
  5   b = 1;
  6 }
  7
  8 void bar(void)
  9 {
 10   while (b == 0) continue;
 11   smp_mb();
 12   assert(a == 1);
 13 }


Quick Quiz C.8: Say what??? Why do we need a memory barrier here, given that the CPU cannot possibly execute the assert() until after the while loop completes? End Quick Quiz

With this change, the sequence of operations might be as follows:

  1. CPU 0 executes a = 1. The corresponding cache line is read-only in CPU 0's cache, so CPU 0 places the new value of ``a'' in its store buffer and transmits an ``invalidate'' message in order to flush the corresponding cache line from CPU 1's cache.
  2. CPU 1 executes while (b == 0) continue, but the cache line containing ``b'' is not in its cache. It therefore transmits a ``read'' message.
  3. CPU 1 receives CPU 0's ``invalidate'' message, queues it, and immediately responds to it.
  4. CPU 0 receives the response from CPU 1, and is therefore free to proceed past the smp_mb() on line 4 above, moving the value of ``a'' from its store buffer to its cache line.
  5. CPU 0 executes b = 1. It already owns this cache line (in other words, the cache line is already in either the ``modified'' or the ``exclusive'' state), so it stores the new value of ``b'' in its cache line.
  6. CPU 0 receives the ``read'' message, and transmits the cache line containing the now-updated value of ``b'' to CPU 1, also marking the line as ``shared'' in its own cache.
  7. CPU 1 receives the cache line containing ``b'' and installs it in its cache.
  8. CPU 1 can now finish executing while (b == 0) continue, and since it finds that the value of ``b'' is 1, it proceeds to the next statement, which is now a memory barrier.
  9. CPU 1 must now stall until it processes all pre-existing messages in its invalidation queue.
  10. CPU 1 now processes the queued ``invalidate'' message, and invalidates the cache line containing ``a'' from its own cache.
  11. CPU 1 executes the assert(a == 1), and, since the cache line containing ``a'' is no longer in CPU 1's cache, it transmits a ``read'' message.
  12. CPU 0 responds to this ``read'' message with the cache line containing the new value of ``a''.
  13. CPU 1 receives this cache line, which contains a value of 1 for ``a'', so that the assertion does not trigger.

With much passing of MESI messages, the CPUs arrive at the correct answer. This section illustrates why CPU designers must be extremely careful with their cache-coherence optimizations.

Paul E. McKenney 2011-12-16