Please keep in mind that the title of this book is ``Is Parallel Programming Hard, And, If So, What Can You Do About It?''. One of the most effective things you can do about it is to avoid forgetting the simple stuff! After all, if you choose to do parallel programming the hard way, you have no one but yourself to blame for it being hard.
Quick Quiz .2:
Is there a simpler way to create a parallel shell script?
If so, how? If not, why not?
Answer:
One straightforward approach is the shell pipeline:
grep $pattern1 | sed -e 's/a/b/' | sort |
Quick Quiz .3:
But if script-based parallel programming is so easy, why
bother with anything else?
Answer:
In fact, it is quite likely that a very large fraction of
parallel programs in use today are script-based.
However, script-based parallelism does have its limitations:
Those requiring finer-grained parallelism are well advised to
think hard about their problem to see if it can be expressed
in a coarse-grained form.
If not, they should consider using other parallel-programming
environments, such as those discussed in
Section .
Quick Quiz .4:
Why does this wait() primitive need to be so complicated?
Why not just make it work like the shell-script wait does?
Answer:
Some parallel applications need to take special action when
specific children exit, and therefore need to wait for each
child individually.
In addition, some parallel applications need to detect the
reason that the child died.
As we saw in Figure ,
it is not hard to build a waitall() function out of
the wait() function, but it would be impossible to
do the reverse.
Once the information about a specific child is lost, it is lost.
Quick Quiz .5:
Isn't there a lot more to fork() and wait()
than discussed here?
Answer:
Indeed there is, and
it is quite possible that this section will be expanded in
future versions to include messaging features (such as UNIX
pipes, TCP/IP, and shared file I/O) and memory mapping
(such as mmap() and shmget()).
In the meantime, there are any number of textbooks that cover
these primitives in great detail,
and the truly motivated can read manpages, existing parallel
applications using these primitives, as well as the
source code of the Linux-kernel implementations themselves.
Quick Quiz .6:
If the mythread() function in
Figure
can simply return, why bother with pthread_exit()?
Answer:
In this simple example, there is no reason whatsoever.
However, imagine a more complex example, where mythread()
invokes other functions, possibly separately compiled.
In such a case, pthread_exit() allows these other functions
to end the thread's execution without having to pass some sort
of error return all the way back up to mythread().
Quick Quiz .7:
If the C language makes no guarantees in presence of a data
race, then why does the Linux kernel have so many data races?
Are you trying to tell me that the Linux kernel is completely
broken???
Answer:
Ah, but the Linux kernel is written in a carefully selected
superset of the C language that includes special gcc
extensions, such as asms, that permit safe execution even
in presence of data races.
In addition, the Linux kernel does not run on a number of
platforms where data races would be especially problematic.
For an example, consider embedded systems with 32-bit pointers
and 16-bit busses.
On such a system, a data race involving a store to and a load
from a given pointer might well result in the load returning the
low-order 16 bits of the old value of the pointer concatenated
with the high-order 16 bits of the new value of the pointer.
Quick Quiz .8:
What if I want several threads to hold the same lock at the
same time?
Answer:
The first thing you should do is to ask yourself why you would
want to do such a thing.
If the answer is ``because I have a lot of data that is read
by many threads, and only occasionally updated'', then
POSIX reader-writer locks might be what you are looking for.
These are introduced in
Section .
Another way to get the effect of multiple threads holding the same lock is for one thread to acquire the lock, and then use pthread_create() to create the other threads. The question of why this would ever be a good idea is left to the reader.
Quick Quiz .9:
Why not simply make the argument to lock_reader()
on line 5 of
Figure
be a pointer to a pthread_mutex_t?
Answer:
Because we will need to pass lock_reader() to
pthread_create().
Although we could cast the function when passing it to
pthread_create(), function casts are quite a bit
uglier and harder to get right than are simple pointer casts.
Quick Quiz .10:
Writing four lines of code for each acquisition and release
of a pthread_mutex_t sure seems painful!
Isn't there a better way?
Answer:
Indeed!
And for that reason, the pthread_mutex_lock() and
pthread_mutex_unlock() primitives are normally wrapped
in functions that do this error checking.
Later on, we will wrapper them with the Linux kernel
spin_lock() and spin_unlock() APIs.
Quick Quiz .11:
Is ``x = 0'' the only possible output from the code fragment
shown in
Figure ?
If so, why?
If not, what other output could appear, and why?
Answer:
No.
The reason that ``x = 0'' was output was that lock_reader()
acquired the lock first.
Had lock_writer() instead acquired the lock first, then
the output would have been ``x = 3''.
However, because the code fragment started lock_reader() first
and because this run was performed on a multiprocessor,
one would normally expect lock_reader() to acquire the
lock first.
However, there are no guarantees, especially on a busy system.
Quick Quiz .12:
Using different locks could cause quite a bit of confusion,
what with threads seeing each others' intermediate states.
So should well-written parallel programs restrict themselves
to using a single lock in order to avoid this kind of confusion?
Answer:
Although it is sometimes possible to write a program using a
single global lock that both performs and scales well, such
programs are exceptions to the rule.
You will normally need to use multiple locks to attain good
performance and scalability.
One possible exception to this rule is ``transactional memory'', which is currently a research topic. Transactional-memory semantics can be thought of as those of a single global lock with optimizations permitted and with the addition of rollback [Boe09].
Quick Quiz .13:
In the code shown in
Figure ,
is lock_reader() guaranteed to see all the values produced
by lock_writer()?
Why or why not?
Answer:
No.
On a busy system, lock_reader() might be preempted
for the entire duration of lock_writer()'s execution,
in which case it would not see any of lock_writer()'s
intermediate states for x.
Quick Quiz .14:
Wait a minute here!!!
Figure
didn't initialize shared variable x,
so why does it need to be initialized in
Figure
?
Answer:
See line 3 of
Figure .
Because the code in
Figure
ran first, it could rely on the compile-time initialization of
x.
The code in
Figure
ran next, so it had to re-initialize x.
Quick Quiz .15:
Isn't comparing against single-CPU throughput a bit harsh?
Answer:
Not at all.
In fact, this comparison was, if anything, overly lenient.
A more balanced comparison would be against single-CPU
throughput with the locking primitives commented out.
Quick Quiz .16:
But 1,000 instructions is not a particularly small size for
a critical section.
What do I do if I need a much smaller critical section, for
example, one containing only a few tens of instructions?
Answer:
If the data being read never changes, then you do not
need to hold any locks while accessing it.
If the data changes sufficiently infrequently, you might be
able to checkpoint execution, terminate all threads, change
the data, then restart at the checkpoint.
Another approach is to keep a single exclusive lock per thread, so that a thread read-acquires the larger aggregate reader-writer lock by acquiring its own lock, and write-acquires by acquiring all the per-thread locks [HW92]. This can work quite well for readers, but causes writers to incur increasingly large overheads as the number of threads increases.
Some other ways of handling very small critical sections are
described in Section .
Quick Quiz .17:
In
Figure ,
all of the traces other than the 100M trace deviate gently
from the ideal line.
In contrast, the 100M trace breaks sharply from the ideal
line at 64 CPUs.
In addition, the spacing between the 100M trace and the 10M
trace is much smaller than that between the 10M trace and the
1M trace.
Why does the 100M trace behave so much differently than the
other traces?
Answer:
Your first clue is that 64 CPUs is exactly half of the 128
CPUs on the machine.
The difference is an artifact of hardware threading.
This system has 64 cores with two hardware threads per core.
As long as fewer than 64 threads are running, each can run
in its own core.
But as soon as there are more than 64 threads, some of the threads
must share cores.
Because the pair of threads in any given core share some hardware
resources, the throughput of two threads sharing a core is not
quite as high as that of two threads each in their own core.
So the performance of the 100M trace is limited not by the
reader-writer lock, but rather by the sharing of hardware resources
between hardware threads in a single core.
This can also be seen in the 10M trace, which deviates gently from the ideal line up to 64 threads, then breaks sharply down, parallel to the 100M trace. Up to 64 threads, the 10M trace is limited primarily by reader-writer lock scalability, and beyond that, also by sharing of hardware resources between hardware threads in a single core.
Quick Quiz .18:
Power 5 is several years old, and new hardware should
be faster.
So why should anyone worry about reader-writer locks being slow?
Answer:
In general, newer hardware is improving.
However, it will need to improve more than two orders of magnitude
to permit reader-writer lock to achieve idea performance on
128 CPUs.
Worse yet, the greater the number of CPUs, the larger the
required performance improvement.
The performance problems of reader-writer locking are therefore
very likely to be with us for quite some time to come.
Quick Quiz .19:
Is it really necessary to have both sets of primitives?
Answer:
Strictly speaking, no.
One could implement any member of the second set using the
corresponding member of the first set.
For example, one could implement __sync_nand_and_fetch()
in terms of __sync_fetch_and_nand() as follows:
tmp = v; ret = __sync_fetch_and_nand(p, tmp); ret = ~ret & tmp; |
It is similarly possible to implement __sync_fetch_and_add(), __sync_fetch_and_sub(), and __sync_fetch_and_xor() in terms of their post-value counterparts.
However, the alternative forms can be quite convenient, both for the programmer and for the compiler/library implementor.
Quick Quiz .20:
Given that these atomic operations will often be able to
generate single atomic instructions that are directly
supported by the underlying instruction set, shouldn't
they be the fastest possible way to get things done?
Answer:
Unfortunately, no.
See Chapter for some stark counterexamples.
Quick Quiz .21:
What happened to the Linux-kernel equivalents to fork()
and join()?
Answer:
They don't really exist.
All tasks executing within the Linux kernel share memory,
at least unless you want to do a huge amount of memory-mapping
work by hand.
Paul E. McKenney 2011-12-16