Quick Quiz .2:
When an ``invalidate'' message appears in a large multiprocessor,
every CPU must give an ``invalidate acknowledge'' response.
Wouldn't the resulting ``storm'' of ``invalidate acknowledge''
responses totally saturate the system bus?
Answer:
It might, if large-scale multiprocessors were in fact implemented
that way. Larger multiprocessors, particularly NUMA machines,
tend to use so-called ``directory-based'' cache-coherence
protocols to avoid this and other problems.
Quick Quiz .3:
If SMP machines are really using message passing
anyway, why bother with SMP at all?
Answer:
There has been quite a bit of controversy on this topic over
the past few decades. One answer is that the cache-coherence
protocols are quite simple, and therefore can be implemented
directly in hardware, gaining bandwidths and latencies
unattainable by software message passing. Another answer is that
the real truth is to be found in economics due to the relative
prices of large SMP machines and that of clusters of smaller
SMP machines. A third answer is that the SMP programming
model is easier to use than that of distributed systems, but
a rebuttal might note the appearance of HPC clusters and MPI.
And so the argument continues.
Quick Quiz .4:
How does the hardware handle the delayed transitions
described above?
Answer:
Usually by adding additional states, though these additional
states need not be actually stored with the cache line, due to
the fact that only a few lines at a time will be transitioning.
The need to delay transitions is but one issue that results in
real-world cache coherence protocols being much more complex than
the over-simplified MESI protocol described in this appendix.
Hennessy and Patterson's classic introduction to computer
architecture [HP95] covers many of these issues.
Quick Quiz .5:
What sequence of operations would put the CPUs' caches
all back into the ``invalid'' state?
Answer:
There is no such sequence, at least in absence of special
``flush my cache'' instructions in the CPU's instruction set.
Most CPUs do have such instructions.
Quick Quiz .6:
In step 1 above, why does CPU 0 need to issue a ``read invalidate''
rather than a simple ``invalidate''?
Answer:
Because the cache line in question contains more than just the
variable a.
Quick Quiz .7:
In step 1 of the first scenario in
Section ,
why is an ``invalidate'' sent instead of a ''read invalidate''
message?
Doesn't CPU 0 need the values of the other variables that share
this cache line with ``a''?
Answer:
CPU 0 already has the values of these variables, given that it
has a read-only copy of the cache line containing ``a''.
Therefore, all CPU 0 need do is to cause the other CPUs to discard
their copies of this cache line.
An ``invalidate'' message therefore suffices.
Quick Quiz .8:
Say what???
Why do we need a memory barrier here, given that the CPU cannot
possibly execute the assert() until after the
while loop completes?
Answer:
CPUs are free to speculatively execute, which can have the effect
of executing the assertion before the while loop completes.
Quick Quiz .9:
Does the guarantee that each CPU sees its own memory accesses
in order also guarantee that each user-level thread will see
its own memory accesses in order?
Why or why not?
Answer:
No. Consider the case where a thread migrates from one CPU to
another, and where the destination CPU perceives the source
CPU's recent memory operations out of order. To preserve
user-mode sanity, kernel hackers must use memory barriers in
the context-switch path. However, the locking already required
to safely do a context switch should automatically provide
the memory barriers needed to cause the user-level task to see
its own accesses in order. That said, if you are designing a
super-optimized scheduler, either in the kernel or at user level,
please keep this scenario in mind!
Quick Quiz .10:
Could this code be fixed by inserting a memory barrier
between CPU 1's ``while'' and assignment to ``c''?
Why or why not?
Answer:
No. Such a memory barrier would only force ordering local to CPU 1.
It would have no effect on the relative ordering of CPU 0's and
CPU 1's accesses, so the assertion could still fail.
However, all mainstream computer systems provide one mechanism
or another to provide ``transitivity'', which provides
intuitive causal ordering: if B saw the effects of A's accesses,
and C saw the effects of B's accesses, then C must also see
the effects of A's accesses.
In short, hardware designers have taken at least a little pity
on software developers.
Quick Quiz .11:
Suppose that lines 3-5 for CPUs 1 and 2 in
Table
are in an interrupt
handler, and that the CPU 2's line 9 is run at process level.
What changes, if any, are required to enable the code to work
correctly, in other words, to prevent the assertion from firing?
Answer:
The assertion will need to written to ensure that the load of
``e'' precedes that of ``a''.
In the Linux kernel, the barrier() primitive may be used to accomplish
this in much the same way that the memory barrier was used in the
assertions in the previous examples.
Quick Quiz .12:
If CPU 2 executed an assert(e==0||c==1) in the example in
Table ,
would this assert ever trigger?
Answer:
The result depends on whether the CPU supports ``transitivity.''
In other words, CPU 0 stored to ``e'' after seeing CPU 1's
store to ``c'', with a memory barrier between CPU 0's load
from ``c'' and store to ``e''.
If some other CPU sees CPU 0's store to ``e'', is it also
guaranteed to see CPU 1's store?
All CPUs I am aware of claim to provide transitivity.
Quick Quiz .13:
Why is Alpha's smp_read_barrier_depends() an
smp_mb() rather than smp_rmb()?
Answer:
First, Alpha has only mb and wmb instructions,
so smp_rmb() would be implemented by the Alpha mb
instruction in either case.
More importantly, smp_read_barrier_depends() must order subsequent stores. For example, consider the following code:
1 p = global_pointer; 2 smp_read_barrier_depends(); 3 if (do_something_with(p->a, p->b) == 0) 4 p->hey_look = 1; |
Here the store to p->hey_look must be ordered, not just the loads from p->a and p->b.
Paul E. McKenney 2011-12-16