F.3 Chapter [*]

Quick Quiz [*].1: 
But this silly shell script isn't a real parallel program! Why bother with such trivia???
 
Answer:
Because you should never forget the simple stuff!

Please keep in mind that the title of this book is ``Is Parallel Programming Hard, And, If So, What Can You Do About It?''. One of the most effective things you can do about it is to avoid forgetting the simple stuff! After all, if you choose to do parallel programming the hard way, you have no one but yourself to blame for it being hard.

Quick Quiz [*].2: 
Is there a simpler way to create a parallel shell script? If so, how? If not, why not?
 
Answer:
One straightforward approach is the shell pipeline:

grep $pattern1 | sed -e 's/a/b/' | sort


For a sufficiently large input file, grep will pattern-match in parallel with sed editing and with the input processing of sort. See the file parallel.sh for a demonstration of shell-script parallelism and pipelining.

Quick Quiz [*].3: 
But if script-based parallel programming is so easy, why bother with anything else?
 
Answer:
In fact, it is quite likely that a very large fraction of parallel programs in use today are script-based. However, script-based parallelism does have its limitations:

  1. Creation of new processes is usually quite heavyweight, involving the expensive fork() and exec() system calls.
  2. Sharing of data, including pipelining, typically involves expensive file I/O.
  3. The reliable synchronization primitives available to scripts also typically involve expensive file I/O.
These limitations require that script-based parallelism use coarse-grained parallelism, with each unit of work having execution time of at least tens of milliseconds, and preferably much longer.

Those requiring finer-grained parallelism are well advised to think hard about their problem to see if it can be expressed in a coarse-grained form. If not, they should consider using other parallel-programming environments, such as those discussed in Section [*].

Quick Quiz [*].4: 
Why does this wait() primitive need to be so complicated? Why not just make it work like the shell-script wait does?
 
Answer:
Some parallel applications need to take special action when specific children exit, and therefore need to wait for each child individually. In addition, some parallel applications need to detect the reason that the child died. As we saw in Figure [*], it is not hard to build a waitall() function out of the wait() function, but it would be impossible to do the reverse. Once the information about a specific child is lost, it is lost.

Quick Quiz [*].5: 
Isn't there a lot more to fork() and wait() than discussed here?
 
Answer:
Indeed there is, and it is quite possible that this section will be expanded in future versions to include messaging features (such as UNIX pipes, TCP/IP, and shared file I/O) and memory mapping (such as mmap() and shmget()). In the meantime, there are any number of textbooks that cover these primitives in great detail, and the truly motivated can read manpages, existing parallel applications using these primitives, as well as the source code of the Linux-kernel implementations themselves.

Quick Quiz [*].6: 
If the mythread() function in Figure [*] can simply return, why bother with pthread_exit()?
 
Answer:
In this simple example, there is no reason whatsoever. However, imagine a more complex example, where mythread() invokes other functions, possibly separately compiled. In such a case, pthread_exit() allows these other functions to end the thread's execution without having to pass some sort of error return all the way back up to mythread().

Quick Quiz [*].7: 
If the C language makes no guarantees in presence of a data race, then why does the Linux kernel have so many data races? Are you trying to tell me that the Linux kernel is completely broken???
 
Answer:
Ah, but the Linux kernel is written in a carefully selected superset of the C language that includes special gcc extensions, such as asms, that permit safe execution even in presence of data races. In addition, the Linux kernel does not run on a number of platforms where data races would be especially problematic. For an example, consider embedded systems with 32-bit pointers and 16-bit busses. On such a system, a data race involving a store to and a load from a given pointer might well result in the load returning the low-order 16 bits of the old value of the pointer concatenated with the high-order 16 bits of the new value of the pointer.

Quick Quiz [*].8: 
What if I want several threads to hold the same lock at the same time?
 
Answer:
The first thing you should do is to ask yourself why you would want to do such a thing. If the answer is ``because I have a lot of data that is read by many threads, and only occasionally updated'', then POSIX reader-writer locks might be what you are looking for. These are introduced in Section [*].

Another way to get the effect of multiple threads holding the same lock is for one thread to acquire the lock, and then use pthread_create() to create the other threads. The question of why this would ever be a good idea is left to the reader.

Quick Quiz [*].9: 
Why not simply make the argument to lock_reader() on line 5 of Figure [*] be a pointer to a pthread_mutex_t?
 
Answer:
Because we will need to pass lock_reader() to pthread_create(). Although we could cast the function when passing it to pthread_create(), function casts are quite a bit uglier and harder to get right than are simple pointer casts.

Quick Quiz [*].10: 
Writing four lines of code for each acquisition and release of a pthread_mutex_t sure seems painful! Isn't there a better way?
 
Answer:
Indeed! And for that reason, the pthread_mutex_lock() and pthread_mutex_unlock() primitives are normally wrapped in functions that do this error checking. Later on, we will wrapper them with the Linux kernel spin_lock() and spin_unlock() APIs.

Quick Quiz [*].11: 
Is ``x = 0'' the only possible output from the code fragment shown in Figure [*]? If so, why? If not, what other output could appear, and why?
 
Answer:
No. The reason that ``x = 0'' was output was that lock_reader() acquired the lock first. Had lock_writer() instead acquired the lock first, then the output would have been ``x = 3''. However, because the code fragment started lock_reader() first and because this run was performed on a multiprocessor, one would normally expect lock_reader() to acquire the lock first. However, there are no guarantees, especially on a busy system.

Quick Quiz [*].12: 
Using different locks could cause quite a bit of confusion, what with threads seeing each others' intermediate states. So should well-written parallel programs restrict themselves to using a single lock in order to avoid this kind of confusion?
 
Answer:
Although it is sometimes possible to write a program using a single global lock that both performs and scales well, such programs are exceptions to the rule. You will normally need to use multiple locks to attain good performance and scalability.

One possible exception to this rule is ``transactional memory'', which is currently a research topic. Transactional-memory semantics can be thought of as those of a single global lock with optimizations permitted and with the addition of rollback [Boe09].

Quick Quiz [*].13: 
In the code shown in Figure [*], is lock_reader() guaranteed to see all the values produced by lock_writer()? Why or why not?
 
Answer:
No. On a busy system, lock_reader() might be preempted for the entire duration of lock_writer()'s execution, in which case it would not see any of lock_writer()'s intermediate states for x.

Quick Quiz [*].14: 
Wait a minute here!!! Figure [*] didn't initialize shared variable x, so why does it need to be initialized in Figure [*]?
 
Answer:
See line 3 of Figure [*]. Because the code in Figure [*] ran first, it could rely on the compile-time initialization of x. The code in Figure [*] ran next, so it had to re-initialize x.

Quick Quiz [*].15: 
Isn't comparing against single-CPU throughput a bit harsh?
 
Answer:
Not at all. In fact, this comparison was, if anything, overly lenient. A more balanced comparison would be against single-CPU throughput with the locking primitives commented out.

Quick Quiz [*].16: 
But 1,000 instructions is not a particularly small size for a critical section. What do I do if I need a much smaller critical section, for example, one containing only a few tens of instructions?
 
Answer:
If the data being read never changes, then you do not need to hold any locks while accessing it. If the data changes sufficiently infrequently, you might be able to checkpoint execution, terminate all threads, change the data, then restart at the checkpoint.

Another approach is to keep a single exclusive lock per thread, so that a thread read-acquires the larger aggregate reader-writer lock by acquiring its own lock, and write-acquires by acquiring all the per-thread locks [HW92]. This can work quite well for readers, but causes writers to incur increasingly large overheads as the number of threads increases.

Some other ways of handling very small critical sections are described in Section [*].

Quick Quiz [*].17: 
In Figure [*], all of the traces other than the 100M trace deviate gently from the ideal line. In contrast, the 100M trace breaks sharply from the ideal line at 64 CPUs. In addition, the spacing between the 100M trace and the 10M trace is much smaller than that between the 10M trace and the 1M trace. Why does the 100M trace behave so much differently than the other traces?
 
Answer:
Your first clue is that 64 CPUs is exactly half of the 128 CPUs on the machine. The difference is an artifact of hardware threading. This system has 64 cores with two hardware threads per core. As long as fewer than 64 threads are running, each can run in its own core. But as soon as there are more than 64 threads, some of the threads must share cores. Because the pair of threads in any given core share some hardware resources, the throughput of two threads sharing a core is not quite as high as that of two threads each in their own core. So the performance of the 100M trace is limited not by the reader-writer lock, but rather by the sharing of hardware resources between hardware threads in a single core.

This can also be seen in the 10M trace, which deviates gently from the ideal line up to 64 threads, then breaks sharply down, parallel to the 100M trace. Up to 64 threads, the 10M trace is limited primarily by reader-writer lock scalability, and beyond that, also by sharing of hardware resources between hardware threads in a single core.

Quick Quiz [*].18: 
Power 5 is several years old, and new hardware should be faster. So why should anyone worry about reader-writer locks being slow?
 
Answer:
In general, newer hardware is improving. However, it will need to improve more than two orders of magnitude to permit reader-writer lock to achieve idea performance on 128 CPUs. Worse yet, the greater the number of CPUs, the larger the required performance improvement. The performance problems of reader-writer locking are therefore very likely to be with us for quite some time to come.

Quick Quiz [*].19: 
Is it really necessary to have both sets of primitives?
 
Answer:
Strictly speaking, no. One could implement any member of the second set using the corresponding member of the first set. For example, one could implement __sync_nand_and_fetch() in terms of __sync_fetch_and_nand() as follows:



tmp = v;
ret = __sync_fetch_and_nand(p, tmp);
ret = ~ret & tmp;


It is similarly possible to implement __sync_fetch_and_add(), __sync_fetch_and_sub(), and __sync_fetch_and_xor() in terms of their post-value counterparts.

However, the alternative forms can be quite convenient, both for the programmer and for the compiler/library implementor.

Quick Quiz [*].20: 
Given that these atomic operations will often be able to generate single atomic instructions that are directly supported by the underlying instruction set, shouldn't they be the fastest possible way to get things done?
 
Answer:
Unfortunately, no. See Chapter [*] for some stark counterexamples.

Quick Quiz [*].21: 
What happened to the Linux-kernel equivalents to fork() and join()?
 
Answer:
They don't really exist. All tasks executing within the Linux kernel share memory, at least unless you want to do a huge amount of memory-mapping work by hand.

Paul E. McKenney 2011-12-16