F.14 Chapter [*]

Quick Quiz [*].1: 
What happens if two CPUs attempt to invalidate the same cache line concurrently?
 
Answer:
One of the CPUs gains access to the shared bus first, and that CPU ``wins''. The other CPU must invalidate its copy of the cache line and transmit an ``invalidate acknowledge'' message to the other CPU.
Of course, the losing CPU can be expected to immediately issue a ``read invalidate'' transaction, so the winning CPU's victory will be quite ephemeral.

Quick Quiz [*].2: 
When an ``invalidate'' message appears in a large multiprocessor, every CPU must give an ``invalidate acknowledge'' response. Wouldn't the resulting ``storm'' of ``invalidate acknowledge'' responses totally saturate the system bus?
 
Answer:
It might, if large-scale multiprocessors were in fact implemented that way. Larger multiprocessors, particularly NUMA machines, tend to use so-called ``directory-based'' cache-coherence protocols to avoid this and other problems.

Quick Quiz [*].3: 
If SMP machines are really using message passing anyway, why bother with SMP at all?
 
Answer:
There has been quite a bit of controversy on this topic over the past few decades. One answer is that the cache-coherence protocols are quite simple, and therefore can be implemented directly in hardware, gaining bandwidths and latencies unattainable by software message passing. Another answer is that the real truth is to be found in economics due to the relative prices of large SMP machines and that of clusters of smaller SMP machines. A third answer is that the SMP programming model is easier to use than that of distributed systems, but a rebuttal might note the appearance of HPC clusters and MPI. And so the argument continues.

Quick Quiz [*].4: 
How does the hardware handle the delayed transitions described above?
 
Answer:
Usually by adding additional states, though these additional states need not be actually stored with the cache line, due to the fact that only a few lines at a time will be transitioning. The need to delay transitions is but one issue that results in real-world cache coherence protocols being much more complex than the over-simplified MESI protocol described in this appendix. Hennessy and Patterson's classic introduction to computer architecture [HP95] covers many of these issues.

Quick Quiz [*].5: 
What sequence of operations would put the CPUs' caches all back into the ``invalid'' state?
 
Answer:
There is no such sequence, at least in absence of special ``flush my cache'' instructions in the CPU's instruction set. Most CPUs do have such instructions.

Quick Quiz [*].6: 
In step 1 above, why does CPU 0 need to issue a ``read invalidate'' rather than a simple ``invalidate''?
 
Answer:
Because the cache line in question contains more than just the variable a.

Quick Quiz [*].7: 
In step 1 of the first scenario in Section [*], why is an ``invalidate'' sent instead of a ''read invalidate'' message? Doesn't CPU 0 need the values of the other variables that share this cache line with ``a''?
 
Answer:
CPU 0 already has the values of these variables, given that it has a read-only copy of the cache line containing ``a''. Therefore, all CPU 0 need do is to cause the other CPUs to discard their copies of this cache line. An ``invalidate'' message therefore suffices.

Quick Quiz [*].8: 
Say what??? Why do we need a memory barrier here, given that the CPU cannot possibly execute the assert() until after the while loop completes?
 
Answer:
CPUs are free to speculatively execute, which can have the effect of executing the assertion before the while loop completes.

Quick Quiz [*].9: 
Does the guarantee that each CPU sees its own memory accesses in order also guarantee that each user-level thread will see its own memory accesses in order? Why or why not?
 
Answer:
No. Consider the case where a thread migrates from one CPU to another, and where the destination CPU perceives the source CPU's recent memory operations out of order. To preserve user-mode sanity, kernel hackers must use memory barriers in the context-switch path. However, the locking already required to safely do a context switch should automatically provide the memory barriers needed to cause the user-level task to see its own accesses in order. That said, if you are designing a super-optimized scheduler, either in the kernel or at user level, please keep this scenario in mind!

Quick Quiz [*].10: 
Could this code be fixed by inserting a memory barrier between CPU 1's ``while'' and assignment to ``c''? Why or why not?
 
Answer:
No. Such a memory barrier would only force ordering local to CPU 1. It would have no effect on the relative ordering of CPU 0's and CPU 1's accesses, so the assertion could still fail. However, all mainstream computer systems provide one mechanism or another to provide ``transitivity'', which provides intuitive causal ordering: if B saw the effects of A's accesses, and C saw the effects of B's accesses, then C must also see the effects of A's accesses. In short, hardware designers have taken at least a little pity on software developers.

Quick Quiz [*].11: 
Suppose that lines 3-5 for CPUs 1 and 2 in Table [*] are in an interrupt handler, and that the CPU 2's line 9 is run at process level. What changes, if any, are required to enable the code to work correctly, in other words, to prevent the assertion from firing?
 
Answer:
The assertion will need to written to ensure that the load of ``e'' precedes that of ``a''. In the Linux kernel, the barrier() primitive may be used to accomplish this in much the same way that the memory barrier was used in the assertions in the previous examples.

Quick Quiz [*].12: 
If CPU 2 executed an assert(e==0||c==1) in the example in Table [*], would this assert ever trigger?
 
Answer:
The result depends on whether the CPU supports ``transitivity.'' In other words, CPU 0 stored to ``e'' after seeing CPU 1's store to ``c'', with a memory barrier between CPU 0's load from ``c'' and store to ``e''. If some other CPU sees CPU 0's store to ``e'', is it also guaranteed to see CPU 1's store?

All CPUs I am aware of claim to provide transitivity.

Quick Quiz [*].13: 
Why is Alpha's smp_read_barrier_depends() an smp_mb() rather than smp_rmb()?
 
Answer:
First, Alpha has only mb and wmb instructions, so smp_rmb() would be implemented by the Alpha mb instruction in either case.

More importantly, smp_read_barrier_depends() must order subsequent stores. For example, consider the following code:



  1 p = global_pointer;
  2 smp_read_barrier_depends();
  3 if (do_something_with(p->a, p->b) == 0)
  4   p->hey_look = 1;


Here the store to p->hey_look must be ordered, not just the loads from p->a and p->b.

Paul E. McKenney 2011-12-16