C.4.3 Invalidate Queues and Memory Barriers
Let us suppose that CPUs queue invalidation requests, but respond to
them immediately.
This approach minimizes the cache-invalidation latency seen by CPUs
doing stores, but can defeat memory barriers, as seen in the following
example.
Suppose the values of ``a'' and ``b'' are initially zero,
that ``a'' is replicated read-only (MESI ``shared'' state),
and that ``b''
is owned by CPU 0 (MESI ``exclusive'' or ``modified'' state).
Then suppose that CPU 0 executes foo() while CPU 1 executes
function bar() in the following code fragment:
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 assert(a == 1);
12 }
|
Then the sequence of operations might be as follows:
- CPU 0 executes a = 1. The corresponding
cache line is read-only in
CPU 0's cache, so CPU 0 places the new value of ``a'' in its
store buffer and transmits an ``invalidate'' message in order
to flush the corresponding cache line from CPU 1's cache.
- CPU 1 executes while (b == 0) continue, but the cache line
containing ``b'' is not in its cache.
It therefore transmits a ``read'' message.
- CPU 1 receives CPU 0's ``invalidate'' message, queues it, and
immediately responds to it.
- CPU 0 receives the response from CPU 1, and is therefore free
to proceed past the smp_mb() on line 4 above, moving
the value of ``a'' from its store buffer to its cache line.
- CPU 0 executes b = 1.
It already owns this cache line (in other words, the cache line
is already in either the ``modified'' or the ``exclusive'' state),
so it stores the new value of ``b'' in its cache line.
- CPU 0 receives the ``read'' message, and transmits the
cache line containing the now-updated value of ``b''
to CPU 1, also marking the line as ``shared'' in its own cache.
- CPU 1 receives the cache line containing ``b'' and installs
it in its cache.
- CPU 1 can now finish executing while (b == 0) continue,
and since it finds that the value of ``b'' is 1, it proceeds
to the next statement.
- CPU 1 executes the assert(a == 1), and, since the
old value of ``a'' is still in CPU 1's cache,
this assertion fails.
- Despite the assertion failure, CPU 1 processes the queued
``invalidate'' message, and (tardily)
invalidates the cache line containing ``a'' from its own cache.
Quick Quiz C.7:
In step 1 of the first scenario in
Section
,
why is an ``invalidate'' sent instead of a ''read invalidate''
message?
Doesn't CPU 0 need the values of the other variables that share
this cache line with ``a''?
End Quick Quiz
There is clearly not much point in accelerating invalidation responses
if doing so causes memory barriers to effectively be ignored.
However, the memory-barrier instructions can interact with
the invalidate queue, so that when a given CPU executes a memory
barrier, it marks all the entries currently in its invalidate queue,
and forces any subsequent load to wait until all marked entries
have been applied to the CPU's cache.
Therefore, we can add a memory barrier to function bar as follows:
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 while (b == 0) continue;
11 smp_mb();
12 assert(a == 1);
13 }
|
Quick Quiz C.8:
Say what???
Why do we need a memory barrier here, given that the CPU cannot
possibly execute the assert() until after the
while loop completes?
End Quick Quiz
With this change, the sequence of operations might be as follows:
- CPU 0 executes a = 1. The corresponding
cache line is read-only in
CPU 0's cache, so CPU 0 places the new value of ``a'' in its
store buffer and transmits an ``invalidate'' message in order
to flush the corresponding cache line from CPU 1's cache.
- CPU 1 executes while (b == 0) continue, but the cache line
containing ``b'' is not in its cache.
It therefore transmits a ``read'' message.
- CPU 1 receives CPU 0's ``invalidate'' message, queues it, and
immediately responds to it.
- CPU 0 receives the response from CPU 1, and is therefore free
to proceed past the smp_mb() on line 4 above, moving
the value of ``a'' from its store buffer to its cache line.
- CPU 0 executes b = 1.
It already owns this cache line (in other words, the cache line
is already in either the ``modified'' or the ``exclusive'' state),
so it stores the new value of ``b'' in its cache line.
- CPU 0 receives the ``read'' message, and transmits the
cache line containing the now-updated value of ``b''
to CPU 1, also marking the line as ``shared'' in its own cache.
- CPU 1 receives the cache line containing ``b'' and installs
it in its cache.
- CPU 1 can now finish executing while (b == 0) continue,
and since it finds that the value of ``b'' is 1, it proceeds
to the next statement, which is now a memory barrier.
- CPU 1 must now stall until it processes all pre-existing
messages in its invalidation queue.
- CPU 1 now processes the queued
``invalidate'' message, and
invalidates the cache line containing ``a'' from its own cache.
- CPU 1 executes the assert(a == 1), and, since the
cache line containing ``a'' is no longer in CPU 1's cache,
it transmits a ``read'' message.
- CPU 0 responds to this ``read'' message with the cache line
containing the new value of ``a''.
- CPU 1 receives this cache line, which contains a value of 1 for
``a'', so that the assertion does not trigger.
With much passing of MESI messages, the CPUs arrive at the correct answer.
This section illustrates why CPU designers must be extremely careful
with their cache-coherence optimizations.
Paul E. McKenney
2011-12-16