One such improved solution is shown in
Figure ,
where the philosophers are simply provided with an additional
five forks.
All five philosophers may now eat simultaneously, and there
is never any need for philosophers to wait on one another.
In addition, the improved disease control provided by this
approach should not be underestimated.
This solution can seem like cheating to some, but such ``cheating'' is key to finding good solutions to many concurrency problems.
Quick Quiz .2:
And in just what sense can this ``horizontal parallelism'' be
said to be ``horizontal''?
Answer:
Inman was working with protocol stacks, which are normally
depicted vertically, with the application on top and the
hardware interconnect on the bottom.
Data flows up and down this stack.
``Horizontal parallelism'' processes packets from different network
connections in parallel, while ``vertical parallelism''
handles different protocol-processing steps for a given
packet in parallel.
``Vertical parallelism'' is also called ``pipelining''.
Quick Quiz .3:
In this compound double-ended queue implementation, what should
be done if the queue has become non-empty while releasing
and reacquiring the lock?
Answer:
In this case, simply dequeue an item from the now-nonempty
queue, release both locks, and return.
Quick Quiz .4:
Is the hashed double-ended queue a good solution?
Why or why not?
Answer:
The best way to answer this is to run lockhdeq.c on
a number of different multiprocessor systems, and you are
encouraged to do so in the strongest possible terms.
One reason for concern is that each operation on this
implementation must acquire not one but two locks.
The first well-designed performance study will be cited. Do not forget to compare to a sequential implementation!
Quick Quiz .5:
Move all the elements to the queue that became empty?
In what possible universe is this braindead solution in any
way optimal???
Answer:
It is optimal in the case where data flow switches direction only
rarely.
It would of course be an extremely poor choice if the double-ended
queue was being emptied from both ends concurrently.
This of course raises the question as to what possible universe
emptying from both ends concurrently would be a reasonable
thing to do...
Quick Quiz .6:
Why can't the compound parallel double-ended queue
implementation be symmetric?
Answer:
The need to avoid deadlock by imposing a lock hierarchy
forces the asymmetry, just as it does in the fork-numbering
solution to the Dining Philosophers Problem.
Quick Quiz .7:
Why is it necessary to retry the right-dequeue operation
on line 29 of
Figure ?
Answer:
This retry is necessary because some other thread might have
enqueued an element between the time that this thread dropped
the lock and the time that it reacquired the lock.
Quick Quiz .8:
Surely the left-hand lock must sometimes be available!!!
So why is it necessary that line 26 of
Figure
unconditionally release the right-hand lock?
Answer:
It would be possible to use spin_trylock() to attempt
to acquire the left-hand lock when it was available.
However, the failure case would still need to drop the
right-hand lock and then re-acquire the two locks in order.
Making this transformation (and determining whether or not
it is worthwhile) is left as an exercise for the reader.
Quick Quiz .9:
The tandem double-ended queue runs about twice as fast as
the hashed double-ended queue, even when I increase the
size of the hash table to an insanely large number.
Why is that?
Answer:
The hashed double-ended queue's locking design only permits
one thread at a time at each end, and further requires
two lock acquisitions for each operation.
The tandem double-ended queue also permits one thread at a time
at each end, and in the common case requires only one lock
acquisition per operation.
Therefore, the tandem double-ended queue should be expected to
outperform the hashed double-ended queue.
Can you created a double-ended queue that allows multiple concurrent operations at each end? If so, how? If not, why not?
Quick Quiz .10:
Is there a significantly better way of handling concurrency
for double-ended queues?
Answer:
Transform the problem to be solved so that multiple double-ended
queues can be used in parallel, allowing the simpler single-lock
double-ended queue to be used, and perhaps also replace each
double-ended queue with a pair of conventional single-ended queues.
Without such ``horizontal scaling'', the speedup is limited
to 2.0.
In contrast, horizontal-scaling designs can enable very large
speedups, and are especially attractive if there are multiple threads
working either end of the queue, because in this
multiple-thread case the deque
simply cannot provide strong ordering guarantees.
And if there are no guarantees, we may as well obtain the
performance benefits that come with refusing to provide the
guarantees, right?
Quick Quiz .11:
What are some ways of preventing a structure from being freed while
its lock is being acquired?
Answer:
Here are a few possible solutions to this existence guarantee
problem:
For more on providing existence guarantees, see
Chapters and
.
Quick Quiz .12:
How can a single-threaded 64-by-64 matrix multiple possibly
have an efficiency of less than 1.0?
Shouldn't all of the traces in
Figure
have efficiency of exactly 1.0 when running on only one thread?
Answer:
The matmul.c program creates the specified number of
worker threads, so even the single-worker-thread case incurs
thread-creation overhead.
Making the changes required to optimize away thread-creation
overhead in the single-worker-thread case is left as an
exercise to the reader.
Quick Quiz .13:
How are data-parallel techniques going to help with matrix
multiply?
It is already data parallel!!!
Answer:
I am glad that you are paying attention!
This example serves to show that although data parallelism can
be a very good thing, it is not some magic wand that automatically
wards off any and all sources of inefficiency.
Linear scaling at full performance, even to ``only'' 64 threads,
requires care at all phases of design and implementation.
In particular, you need to pay careful attention to the size of the partitions. For example, if you split a 64-by-64 matrix multiply across 64 threads, each thread gets only 64 floating-point multiplies. The cost of a floating-point multiply is miniscule compared to the overhead of thread creation.
Moral: If you have a parallel program with variable input, always include a check for the input size being too small to be worth parallelizing. And when it is not helpful to parallelize, it is not helpful to spawn a single thread, now is it?
Quick Quiz .14:
In what situation would hierarchical locking work well?
Answer:
If the comparison on line 31 of
Figure
were replaced by a much heavier-weight operation,
then releasing bp->bucket_lock might reduce lock
contention enough to outweigh the overhead of the extra
acquisition and release of cur->node_lock.
Quick Quiz .15:
In Figure ,
there is a pattern of performance rising with increasing run
length in groups of three samples, for example, for run lengths
10, 11, and 12.
Why?
Answer:
This is due to the per-CPU target value being three.
A run length of 12 must acquire the global-pool lock twice,
while a run length of 13 must acquire the global-pool lock
three times.
Quick Quiz .16:
Allocation failures were observed in the two-thread
tests at run lengths of 19 and greater.
Given the global-pool size of 40 and the per-CPU target
pool size of three, what is the smallest allocation run
length at which failures can occur?
Answer:
The exact solution to this problem is left as an exercise to
the reader.
The first solution received will be credited to its submitter.
As a rough rule of thumb, the global pool size should be at least
, where
``m'' is the maximum number of elements allocated at a given time,
``s'' is the per-CPU pool size,
and ``n'' is the number of CPUs.
Paul E. McKenney 2011-12-16