F.5 Chapter [*]

Quick Quiz [*].1: 
Is there a better solution to the Dining Philosophers Problem?
 
Answer:

Figure: Dining Philosophers Problem, Fully Partitioned
\includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5PEM}

One such improved solution is shown in Figure [*], where the philosophers are simply provided with an additional five forks. All five philosophers may now eat simultaneously, and there is never any need for philosophers to wait on one another. In addition, the improved disease control provided by this approach should not be underestimated.

This solution can seem like cheating to some, but such ``cheating'' is key to finding good solutions to many concurrency problems.

Quick Quiz [*].2: 
And in just what sense can this ``horizontal parallelism'' be said to be ``horizontal''?
 
Answer:
Inman was working with protocol stacks, which are normally depicted vertically, with the application on top and the hardware interconnect on the bottom. Data flows up and down this stack. ``Horizontal parallelism'' processes packets from different network connections in parallel, while ``vertical parallelism'' handles different protocol-processing steps for a given packet in parallel.

``Vertical parallelism'' is also called ``pipelining''.

Quick Quiz [*].3: 
In this compound double-ended queue implementation, what should be done if the queue has become non-empty while releasing and reacquiring the lock?
 
Answer:
In this case, simply dequeue an item from the now-nonempty queue, release both locks, and return.

Quick Quiz [*].4: 
Is the hashed double-ended queue a good solution? Why or why not?
 
Answer:
The best way to answer this is to run lockhdeq.c on a number of different multiprocessor systems, and you are encouraged to do so in the strongest possible terms. One reason for concern is that each operation on this implementation must acquire not one but two locks.

The first well-designed performance study will be cited. Do not forget to compare to a sequential implementation!

Quick Quiz [*].5: 
Move all the elements to the queue that became empty? In what possible universe is this braindead solution in any way optimal???
 
Answer:
It is optimal in the case where data flow switches direction only rarely. It would of course be an extremely poor choice if the double-ended queue was being emptied from both ends concurrently. This of course raises the question as to what possible universe emptying from both ends concurrently would be a reasonable thing to do...

Quick Quiz [*].6: 
Why can't the compound parallel double-ended queue implementation be symmetric?
 
Answer:
The need to avoid deadlock by imposing a lock hierarchy forces the asymmetry, just as it does in the fork-numbering solution to the Dining Philosophers Problem.

Quick Quiz [*].7: 
Why is it necessary to retry the right-dequeue operation on line 29 of Figure [*]?
 
Answer:
This retry is necessary because some other thread might have enqueued an element between the time that this thread dropped the lock and the time that it reacquired the lock.

Quick Quiz [*].8: 
Surely the left-hand lock must sometimes be available!!! So why is it necessary that line 26 of Figure [*] unconditionally release the right-hand lock?
 
Answer:
It would be possible to use spin_trylock() to attempt to acquire the left-hand lock when it was available. However, the failure case would still need to drop the right-hand lock and then re-acquire the two locks in order. Making this transformation (and determining whether or not it is worthwhile) is left as an exercise for the reader.

Quick Quiz [*].9: 
The tandem double-ended queue runs about twice as fast as the hashed double-ended queue, even when I increase the size of the hash table to an insanely large number. Why is that?
 
Answer:
The hashed double-ended queue's locking design only permits one thread at a time at each end, and further requires two lock acquisitions for each operation. The tandem double-ended queue also permits one thread at a time at each end, and in the common case requires only one lock acquisition per operation. Therefore, the tandem double-ended queue should be expected to outperform the hashed double-ended queue.

Can you created a double-ended queue that allows multiple concurrent operations at each end? If so, how? If not, why not?

Quick Quiz [*].10: 
Is there a significantly better way of handling concurrency for double-ended queues?
 
Answer:
Transform the problem to be solved so that multiple double-ended queues can be used in parallel, allowing the simpler single-lock double-ended queue to be used, and perhaps also replace each double-ended queue with a pair of conventional single-ended queues. Without such ``horizontal scaling'', the speedup is limited to 2.0. In contrast, horizontal-scaling designs can enable very large speedups, and are especially attractive if there are multiple threads working either end of the queue, because in this multiple-thread case the deque simply cannot provide strong ordering guarantees. And if there are no guarantees, we may as well obtain the performance benefits that come with refusing to provide the guarantees, right?

Quick Quiz [*].11: 
What are some ways of preventing a structure from being freed while its lock is being acquired?
 
Answer:
Here are a few possible solutions to this existence guarantee problem:

  1. Provide a statically allocated lock that is held while the per-structure lock is being acquired, which is an example of hierarchical locking (see Section [*]). Of course, using a single global lock for this purpose can result in unacceptably high levels of lock contention, dramatically reducing performance and scalability.
  2. Provide an array of statically allocated locks, hashing the structure's address to select the lock to be acquired, as described in Chapter [*]. Given a hash function of sufficiently high quality, this avoids the scalability limitations of the single global lock, but in read-mostly situations, the lock-acquisition overhead can result in unacceptably degraded performance.
  3. Use a garbage collector, in software environments providing them, so that a structure cannot be deallocated while being referenced. This works very well, removing the existence-guarantee burden (and much else besides) from the developer's shoulders, but imposes the overhead of garbage collection on the program. Although garbage-collection technology has advanced considerably in the past few decades, its overhead may be unacceptably high for some applications. In addition, some applications require that the developer exercise more control over the layout and placement of data structures than is permitted by most garbage collected environments.
  4. As a special case of a garbage collector, use a global reference counter, or a global array of reference counters.
  5. Use hazard pointers [Mic04], which can be thought of as an inside-out reference count. Hazard-pointer-based algorithms maintain a per-thread list of pointers, so that the appearance of a given pointer on any of these lists acts as a reference to the corresponding structure. Hazard pointers are an interesting research direction, but have not yet seen much use in production (written in 2008).
  6. Use transactional memory (TM) [HM93,Lom77,ST95], so that each reference and modification to the data structure in question is performed atomically. Although TM has engendered much excitement in recent years, and seems likely to be of some use in production software, developers should exercise some caution [BLM05,BLM06,MMW07], particularly in performance-critical code. In particular, existence guarantees require that the transaction cover the full path from a global reference to the data elements being updated.
  7. Use RCU, which can be thought of as an extremely lightweight approximation to a garbage collector. Updaters are not permitted to free RCU-protected data structures that RCU readers might still be referencing. RCU is most heavily used for read-mostly data structures, and is discussed at length in Chapter [*].

For more on providing existence guarantees, see Chapters [*] and [*].

Quick Quiz [*].12: 
How can a single-threaded 64-by-64 matrix multiple possibly have an efficiency of less than 1.0? Shouldn't all of the traces in Figure [*] have efficiency of exactly 1.0 when running on only one thread?
 
Answer:
The matmul.c program creates the specified number of worker threads, so even the single-worker-thread case incurs thread-creation overhead. Making the changes required to optimize away thread-creation overhead in the single-worker-thread case is left as an exercise to the reader.

Quick Quiz [*].13: 
How are data-parallel techniques going to help with matrix multiply? It is already data parallel!!!
 
Answer:
I am glad that you are paying attention! This example serves to show that although data parallelism can be a very good thing, it is not some magic wand that automatically wards off any and all sources of inefficiency. Linear scaling at full performance, even to ``only'' 64 threads, requires care at all phases of design and implementation.

In particular, you need to pay careful attention to the size of the partitions. For example, if you split a 64-by-64 matrix multiply across 64 threads, each thread gets only 64 floating-point multiplies. The cost of a floating-point multiply is miniscule compared to the overhead of thread creation.

Moral: If you have a parallel program with variable input, always include a check for the input size being too small to be worth parallelizing. And when it is not helpful to parallelize, it is not helpful to spawn a single thread, now is it?

Quick Quiz [*].14: 
In what situation would hierarchical locking work well?
 
Answer:
If the comparison on line 31 of Figure [*] were replaced by a much heavier-weight operation, then releasing bp->bucket_lock might reduce lock contention enough to outweigh the overhead of the extra acquisition and release of cur->node_lock.

Quick Quiz [*].15: 
In Figure [*], there is a pattern of performance rising with increasing run length in groups of three samples, for example, for run lengths 10, 11, and 12. Why?
 
Answer:
This is due to the per-CPU target value being three. A run length of 12 must acquire the global-pool lock twice, while a run length of 13 must acquire the global-pool lock three times.

Quick Quiz [*].16: 
Allocation failures were observed in the two-thread tests at run lengths of 19 and greater. Given the global-pool size of 40 and the per-CPU target pool size of three, what is the smallest allocation run length at which failures can occur?
 
Answer:
The exact solution to this problem is left as an exercise to the reader. The first solution received will be credited to its submitter. As a rough rule of thumb, the global pool size should be at least $m+2sn$, where ``m'' is the maximum number of elements allocated at a given time, ``s'' is the per-CPU pool size, and ``n'' is the number of CPUs.

Paul E. McKenney 2011-12-16