F.4 Chapter [*]

Quick Quiz [*].1: 
Why on earth should efficient and scalable counting be hard? After all, computers have special hardware for the sole purpose of doing counting, addition, subtraction, and lots more besides, don't they???
 
Answer:
Because the straightforward counting algorithms, for example, atomic operations on a shared counter, are slow and scale badly, as will be seen in Section [*].

Quick Quiz [*].2: 
Network-packet counting problem. Suppose that you need to collect statistics on the number of networking packets (or total number of bytes) transmitted and/or received. Packets might be transmitted or received by any CPU on the system. Suppose further that this large machine is capable of handling a million packets per second, and that there is a systems-monitoring package that reads out the count every five seconds. How would you implement this statistical counter?
 
Answer:
Hint: the act of updating the counter must be blazingly fast, but because the counter is read out only about once in five million updates, the act of reading out the counter can be quite slow. In addition, the value read out normally need not be all that accurate--after all, since the counter is updated a thousand times per millisecond, we should be able to work with a value that is within a few thousand counts of the ``true value'', whatever ``true value'' might mean in this context. However, the value read out should maintain roughly the same absolute error over time. For example, a 1% error might be just fine when the count is on the order of a million or so, but might be absolutely unacceptable once the count reaches a trillion. See Section [*].

Quick Quiz [*].3: 
Approximate structure-allocation limit problem. Suppose that you need to maintain a count of the number of structures allocated in order to fail any allocations once the number of structures in use exceeds a limit (say, 10,000). Suppose further that these structures are short-lived, that the limit is rarely exceeded, and that a ``sloppy'' approximate limit is acceptable.
 
Answer:
Hint: the act of updating the counter must be blazingly fast, but the counter is read out each time that the counter is increased. However, the value read out need not be accurate except that it absolutely must distinguish perfectly between values below the limit and values greater than or equal to the limit. See Section [*].

Quick Quiz [*].4: 
Exact structure-allocation limit problem. Suppose that you need to maintain a count of the number of structures allocated in order to fail any allocations once the number of structures in use exceeds an exact limit (say, 10,000). Suppose further that these structures are short-lived, and that the limit is rarely exceeded, that there is almost always at least one structure in use, and suppose further still that it is necessary to know exactly when this counter reaches zero, for example, in order to free up some memory that is not required unless there is at least one structure in use.
 
Answer:
Hint: the act of updating the counter must be blazingly fast, but the counter is read out each time that the counter is increased. However, the value read out need not be accurate except that it absolutely must distinguish perfectly between values between the limit and zero on the one hand, and values that either are less than or equal to zero or are greater than or equal to the limit on the other hand. See Section [*].

Quick Quiz [*].5: 
Removable I/O device access-count problem. Suppose that you need to maintain a reference count on a heavily used removable mass-storage device, so that you can tell the user when it is safe to removed the device. This device follows the usual removal procedure where the user indicates a desire to remove the device, and the system tells the user when it is safe to do so.
 
Answer:
Hint: the act of updating the counter must be blazingly fast and scalable in order to avoid slowing down I/O operations, but because the counter is read out only when the user wishes to remove the device, the counter read-out operation can be extremely slow. Furthermore, there is no need to be able to read out the counter at all unless the user has already indicated a desire to remove the device. In addition, the value read out need not be accurate except that it absolutely must distinguish perfectly between non-zero and zero values. However, once it has read out a zero value, it must act to keep the value at zero until it has taken some action to prevent subsequent threads from gaining access to the device being removed. See Section [*].

Quick Quiz [*].6: 
But doesn't the ++ operator produce an x86 add-to-memory instruction? And won't the CPU cache cause this to be atomic?
 
Answer:
Although the ++ operator could be atomic, there is no requirement that it be so. Furthermore, the ACCESS_ONCE() primitive forces most version of gcc to load the value to a register, increment the register, then store the value to memory, which is decidedly non-atomic.

Quick Quiz [*].7: 
The 8-figure accuracy on the number of failures indicates that you really did test this. Why would it be necessary to test such a trivial program, especially when the bug is easily seen by inspection?
 
Answer:
There are no trivial parallel programs, and most days I am not so sure that there are trivial sequential programs, either.

No matter how small or simple the program, if you haven't tested it, it does not work. And even if you have tested it, Murphy says there are at least a few bugs still lurking.

Furthermore, while proofs of correctness certainly do have their place, they never will replace testing, including the counttorture.h test setup used here. After all, proofs can have bugs just as easily has can programs!

Quick Quiz [*].8: 
Why doesn't the dashed line on the x axis meet the diagonal line at $y=1$?
 
Answer:
Because of the overhead of the atomic operation. The dashed line on the x axis represents the overhead of a single non-atomic increment. After all, an ideal algorithm would not only scale linearly, it would also incur no performance penalty compared to single-threaded code.

This level of idealism may seem severe, but if it is good enough for Linus Torvalds, it is good enough for you.

Quick Quiz [*].9: 
But atomic increment is still pretty fast. And incrementing a single variable in a tight loop sounds pretty unrealistic to me, after all, most of the program's execution should be devoted to actually doing work, not accounting for the work it has done! Why should I care about making this go faster?
 
Answer:
In many cases, atomic increment will in fact be fast enough for you. In those cases, you should by all means use atomic increment. That said, there are many real-world situations where more elaborate counting algorithms are required. The canonical example of such a situation is counting packets and bytes in highly optimized networking stacks, where it is all too easy to find much of the execution time going into these sorts of accounting tasks, especially on large multiprocessors.

In addition, counting provides an excellent view of the issues encountered in shared-memory parallel programs.

Quick Quiz [*].10: 
But why can't CPU designers simply ship the operation to the data, avoiding the need to circulate the cache line containing the global variable being incremented?
 
Answer:
It might well be possible to do this in some cases. However, there are a few complications:

  1. If the value of the variable is required, then the thread will be forced to wait for the operation to be shipped to the data, and then for the result to be shipped back.
  2. If the atomic increment must be ordered with respect to prior and/or subsequent operations, then the thread will be forced to wait for the operation to be shipped to the data, and for an indication that the operation completed to be shipped back.
  3. Shipping operations among CPUs will likely require more signals, which will consume more die area and more electrical power.
But what if neither of the first two conditions holds? Then you should think carefully about the algorithms discussed in Section [*], which achieve near-ideal performance on commodity hardware.

Figure: Data Flow For Global Combining-Tree Atomic Increment
\resizebox{3in}{!}{\includegraphics{count/GlobalTreeInc}}

If either or both of the first two conditions hold, there is some hope for improvement. One could imagine the hardware implementing a combining tree, so that the increment requests from multiple CPUs are combined by the hardware into a single addition when the combined request reaches the hardware. The hardware could also apply an order to the requests, thus returning to each CPU the return value corresponding to its particular atomic increment. This results in instruction latency that varies as $O(log N)$, where $N$ is the number of CPUs, as shown in Figure [*].

This is a great improvement over the $O(N)$ performance of current hardware shown in Figure [*], and it is possible that hardware latencies might decrease somewhat if innovations such as three-D fabrication prove practical. Nevertheless, we will see that in some important special cases, software can do much better.

Quick Quiz [*].11: 
But doesn't the fact that C's ``integers'' are limited in size complicate things?
 
Answer:
No, because modulo addition is still commutative and associative. At least as long as you use unsigned integer. Recall that in the C standard, overflow of signed integers results in undefined behavior (never mind the fact that machines that do anything other than wrap on overflow are quite rare these days.

That said, one potential source of additional complexity arises when attempting to gather (say) a 64-bit sum from 32-bit per-thread counters. For the moment, dealing with this added complexity is left as an exercise for the reader.

Quick Quiz [*].12: 
An array??? But doesn't that limit the number of threads?
 
Answer:
It can, and in this toy implementation, it does. But it is not that hard to come up with an alternative implementation that permits an arbitrary number of threads. However, this is left as an exercise for the reader.

Quick Quiz [*].13: 
What other choice does gcc have, anyway???
 
Answer:
According to the C standard, the effects of fetching a variable that might be concurrently modified by some other thread are undefined. It turns out that the C standard really has no other choice, given that C must support (for example) eight-bit architectures which are incapable of atomically loading a long. An upcoming version of the C standard aims to fill this gap, but until then, we depend on the kindness of the gcc developers.

Quick Quiz [*].14: 
How does the per-thread counter variable in Figure [*] get initialized?
 
Answer:
The C standard specifies that the initial value of global variables is zero, unless they are explicitly initialized. So the initial value of all the instances of counter will be zero.

That said, one often takes differences of consecutive reads from statistical counters, in which case the initial value is irrelevant.

Quick Quiz [*].15: 
How is the code in Figure [*] supposed to permit more than one counter?
 
Answer:
Indeed, this toy example does not support more than one counter. Modifying it so that it can provide multiple counters is left as an exercise to the reader.

Quick Quiz [*].16: 
Why does inc_count() in Figure [*] need to use atomic instructions?
 
Answer:
If non-atomic instructions were used, counts could be lost.

Quick Quiz [*].17: 
Won't the single global thread in the function eventual() of Figure [*] be just as severe a bottleneck as a global lock would be?
 
Answer:
In this case, no. What will happen instead is that the estimate of the counter value returned by read_count() will become more inaccurate.

Quick Quiz [*].18: 
Won't the estimate returned by read_count() in Figure [*] become increasingly inaccurate as the number of threads rises?
 
Answer:
Yes. If this proves problematic, one fix is to provide multiple eventual() threads, each covering its own subset of the other threads. In even more extreme cases, a tree-like hierarchy of eventual() threads might be required.

Quick Quiz [*].19: 
Why do we need an explicit array to find the other threads' counters? Why doesn't gcc provide a per_thread() interface, similar to the Linux kernel's per_cpu() primitive, to allow threads to more easily access each others' per-thread variables?
 
Answer:
Why indeed?

To be fair, gcc faces some challenges that the Linux kernel gets to ignore. When a user-level thread exits, its per-thread variables all disappear, which complicates the problem of per-thread-variable access, particularly before the advent of user-level RCU. In contrast, in the Linux kernel, when a CPU goes offline, that CPU's per-CPU variables remain mapped and accessible.

Similarly, when a new user-level thread is created, its per-thread variables suddenly come into existence. In contrast, in the Linux kernel, all per-CPU variables are mapped and initialized at boot time, regardless of whether the corresponding CPU exists yet, or indeed, whether the corresponding CPU will ever exist.

A key limitation that the Linux kernel imposes is a compile-time maximum limit on the number of CPUs, namely, CONFIG_NR_CPUS. In contrast, in user space, there is no hard-coded upper limit on the number of threads.

Of course, both environments must deal with dynamically loaded code (dynamic libraries in user space, kernel modules in the Linux kernel), which increases the complexity of per-thread variables in both environments.

These complications make it significantly harder for user-space environments to provide access to other threads' per-thread variables. Nevertheless, such access is highly useful, and it is hoped that it will someday appear.

Quick Quiz [*].20: 
Why on earth do we need something as heavyweight as a lock guarding the summation in the function read_count() in Figure [*]?
 
Answer:
Remember, when a thread exits, its per-thread variables disappear. Therefore, if we attempt to access a given thread's per-thread variables after that thread exits, we will get a segmentation fault. The lock coordinates summation and thread exit, preventing this scenario.

Of course, we could instead read-acquire a reader-writer lock, but Chapter [*] will introduce even lighter-weight mechanisms for implementing the required coordination.

Quick Quiz [*].21: 
Why on earth do we need to acquire the lock in count_register_thread() in Figure [*]? It is a single properly aligned machine-word store to a location that no other thread is modifying, so it should be atomic anyway, right?
 
Answer:
This lock could in fact be omitted, but better safe than sorry, especially given that this function is executed only at thread startup, and is therefore not on any critical path. Now, if we were testing on machines with thousands of CPUs, we might need to omit the lock, but on machines with ``only'' a hundred or so CPUs, no need to get fancy.

Quick Quiz [*].22: 
Fine, but the Linux kernel doesn't have to acquire a lock when reading out the aggregate value of per-CPU counters. So why should user-space code need to do this???
 
Answer:
Remember, the Linux kernel's per-CPU variables are always accessible, even if the corresponding CPU is offline -- even if the corresponding CPU never existed and never will exist.

Figure: Per-Thread Statistical Counters With Lockless Summation
\begin{figure}{ \scriptsize
\begin{verbatim}1 long __thread counter = 0;
2 l...
...nt < nthreadsexpected)
37 poll(NULL, 0, 1);
38 }\end{verbatim}
}\end{figure}

One workaround is to ensure that each thread sticks around until all threads are finished, as shown in Figure [*]. Analysis of this code is left as an exercise to the reader, however, please note that it does not fit well into the counttorture.h counter-evaluation scheme. (Why not?) Chapter [*] will introduce synchronization mechanisms that handle this situation in a much more graceful manner.

Quick Quiz [*].23: 
What fundamental difference is there between counting packets and counting the total number of bytes in the packets, given that the packets vary in size?
 
Answer:
When counting packets, the counter is only incremented by the value one. On the other hand, when counting bytes, the counter might be incremented by largish numbers.

Why does this matter? Because in the increment-by-one case, the value returned will be exact in the sense that the counter must necessarily have taken on that value at some point in time, even if it is impossible to say precisely when that point occurred. In contrast, when counting bytes, two different threads might return values that are inconsistent with any global ordering of operations.

To see this, suppose that thread 0 adds the value three to its counter, thread 1 adds the value five to its counter, and threads 2 and 3 sum the counters. If the system is ``weakly ordered'' or if the compiler uses aggressive optimizations, thread 2 might find the sum to be three and thread 3 might find the sum to be five. The only possible global orders of the sequence of values of the counter are 0,3,8 and 0,5,8, and neither order is consistent with the results obtained.

If you missed this one, you are not alone. Michael Scott used this question to stump Paul McKenney during Paul's Ph.D. defense.

Quick Quiz [*].24: 
Given that the reader must sum all the threads' counters, this could take a long time given large numbers of threads. Is there any way that the increment operation can remain fast and scalable while allowing readers to also enjoy reasonable performance and scalability?
 
Answer:
One approach would be to maintain a global approximation to the value. Readers would increment their per-thread variable, but when it reached some predefined limit, atomically add it to a global variable, then zero their per-thread variable. This would permit a tradeoff between average increment overhead and accuracy of the value read out.

The reader is encouraged to think up and try out other approaches, for example, using a combining tree.

Quick Quiz [*].25: 
What is with the strange form of the condition on line 3 of Figure [*]? Why not the following more intuitive form of the fastpath?



  3 if (counter + delta <= countermax){
  4   counter += delta;
  5   return 1;
  6 }



 
Answer:
Two words. ``Integer overflow.''

Try the above formulation with counter equal to 10 and delta equal to ULONG_MAX. Then try it again with the code shown in Figure [*].

A good understanding of integer overflow will be required for the rest of this example, so if you have never dealt with integer overflow before, please try several examples to get the hang of it. Integer overflow can sometimes be more difficult to get right than parallel algorithms!

Quick Quiz [*].26: 
Why do globalize_count() to zero the per-thread variables, only to later call balance_count() to refill them in Figure [*]? Why not just leave the per-thread variables non-zero?
 
Answer:
That is in fact what an earlier version of this code did. But addition and subtraction are extremely cheap, and handling all of the special cases that arise is quite complex. Again, feel free to try it yourself, but beware of integer overflow!

Quick Quiz [*].27: 
Given that globalreserve counted against us in add_count(), why doesn't it count for us in sub_count() in Figure [*]?
 
Answer:
The globalreserve variable tracks the sum of all threads' countermax variables. The sum of these threads' counter variables might be anywhere from zero to globalreserve. We must therefore take a conservative approach, assuming that all threads' counter variables are full in add_count() and that they are all empty in sub_count().

But remember this question, as we will come back to it later.

Quick Quiz [*].28: 
Why have both add_count() and sub_count() in Figure [*]? Why not simply pass a negative number to add_count()?
 
Answer:
Given that add_count() takes an unsigned long as its argument, it is going to be a bit tough to pass it a negative number. And unless you have some anti-matter memory, there is little point in allowing negative numbers when counting the number of structures in use!

Quick Quiz [*].29: 
In what way does line 7 of Figure [*] violate the C standard?
 
Answer:
It assumes eight bits per byte. This assumption does hold for all current commodity microprocessors that can be easily assembled into shared-memory multiprocessors, but certainly does not hold for all computer systems that have ever run C code. (What could you do instead in order to comply with the C standard? What drawbacks would it have?)

Quick Quiz [*].30: 
Given that there is only one counterandmax variable, why bother passing in a pointer to it on line 18 of Figure [*]?
 
Answer:
There is only one counterandmax variable per thread. Later, we will see code that needs to pass other threads' counterandmax variables to split_counterandmax().

Quick Quiz [*].31: 
Why does merge_counterandmax() in Figure [*] return an int rather than storing directly into an atomic_t?
 
Answer:
Later, we will see that we need the int return to pass to the atomic_cmpxchg() primitive.

Quick Quiz [*].32: 
Yecch! Why the ugly goto on line 11 of Figure [*]? Haven't you heard of the break statement???
 
Answer:
Replacing the goto with a break would require keeping a flag to determine whether or not line 15 should return, which is not the sort of thing you want on a fastpath. If you really hate the goto that much, your best bet would be to pull the fastpath into a separate function that returned success or failure, with ``failure'' indicating a need for the slowpath. This is left as an exercise for goto-hating readers.

Quick Quiz [*].33: 
Why would the atomic_cmpxchg() primitive at lines 13-14 of Figure [*] ever fail? After all, we picked up its old value on line 9 and have not changed it!
 
Answer:
Later, we will see how the flush_local_count() function in Figure [*] might update this thread's counterandmax variable concurrently with the execution of the fastpath on lines 8-14 of Figure [*].

Quick Quiz [*].34: 
What stops a thread from simply refilling its counterandmax variable immediately after flush_local_count() on line 14 of Figure [*] empties it?
 
Answer:
This other thread cannot refill its counterandmax until the caller of flush_local_count() releases the gblcnt_mutex. By that time, the caller of flush_local_count() will have finished making use of the counts, so there will be no problem with this other thread refilling -- assuming that the value of globalcount is large enough to permit a refill.

Quick Quiz [*].35: 
What prevents concurrent execution of the fastpath of either atomic_add() or atomic_sub() from interfering with the counterandmax variable while flush_local_count() is accessing it on line 27 of Figure [*] empties it?
 
Answer:
Nothing. Consider the following three cases:

  1. If flush_local_count()'s atomic_xchg() executes before the split_counterandmax() of either fastpath, then the fastpath will see a zero counter and countermax, and will thus transfer to the slowpath (unless of course delta is zero).
  2. If flush_local_count()'s atomic_xchg() executes after the split_counterandmax() of either fastpath, but before that fastpath's atomic_cmpxchg(), then the atomic_cmpxchg() will fail, causing the fastpath to restart, which reduces to case 1 above.
  3. If flush_local_count()'s atomic_xchg() executes after the atomic_cmpxchg() of either fastpath, then the fastpath will (most likely) complete successfully before flush_local_count() zeroes the thread's counterandmax variable.
Either way, the race is resolved correctly.

Quick Quiz [*].36: 
Given that the atomic_set() primitive does a simple store to the specified atomic_t, how can line 53 of balance_count() in Figure [*] work correctly in face of concurrent flush_local_count() updates to this variable?
 
Answer:
The caller of both balance_count() and flush_local_count() hold gblcnt_mutex, so only one may be executing at a given time.

Quick Quiz [*].37: 
But signal handlers can be migrated to some other CPU while running. Doesn't this possibility require that atomic instructions and memory barriers are required to reliably communicate between a thread and a signal handler that interrupts that thread?
 
Answer:
No. If the signal handler is migrated to another CPU, then the interrupted thread is also migrated along with it.

Quick Quiz [*].38: 
In Figure [*], why is the REQ theft state colored blue?
 
Answer:
To indicate that only the fastpath is permitted to change the theft state.

Quick Quiz [*].39: 
In Figure [*], what is the point of having separate REQ and ACK theft states? Why not simplify the state machine by collapsing them into a single state? Then whichever of the signal handler or the fastpath gets there first could set the state to READY.
 
Answer:
Reasons why collapsing the REQ and ACK states would be a very bad idea include:

  1. The slowpath uses the REQ and ACK states to determine whether the signal should be retransmitted. If the states were collapsed, the slowpath would have no choice but to send redundant signals, which would have the unhelpful effect of slowing down the fastpath.
  2. The following race would result:
    1. The slowpath sets a given thread's state to REQACK.
    2. That thread has just finished its fastpath, and notes the REQACK state.
    3. The thread receives the signal, which also notes the REQACK state, and, because there is no fastpath in effect, sets the state to READY.
    4. The slowpath notes the READY state, steals the count, and sets the state to IDLE, and completes.
    5. The fastpath sets the state to READY, disabling further fastpath execution for this thread.
    The basic problem here is that the combined REQACK state can be referenced by both the signal handler and the fastpath. The clear separation maintained by the four-state setup ensures orderly state transitions.
That said, you might well be able to make a three-state setup work correctly. If you do succeed, compare carefully to the four-state setup. Is the three-state solution really preferable, and why or why not?

Quick Quiz [*].40: 
In Figure [*] function flush_local_count_sig(), why are there ACCESS_ONCE() wrappers around the uses of the theft per-thread variable?
 
Answer:
The first one (on line 11) can be argued to be unnecessary. The last two (lines 14 and 16) are important. If these are removed, the compiler would be within its rights to rewrite lines 14-17 as follows:

 14   theft = THEFT_READY;
 15   if (counting) {
 16     theft = THEFT_ACK;
 17   }


This would be fatal, as the slowpath might see the transient value of THEFT_READY, and start stealing before the corresponding thread was ready.

Quick Quiz [*].41: 
In Figure [*], why is it safe for line 28 to directly access the other thread's countermax variable?
 
Answer:
Because the other thread is not permitted to change the value of its countermax variable unless it holds the gblcnt_mutex lock. But the caller has acquired this lock, so it is not possible for the other thread to hold it, and therefore the other thread is not permitted to change its countermax variable. We can therefore safely access it -- but not change it.

Quick Quiz [*].42: 
In Figure [*], why doesn't line 33 check for the current thread sending itself a signal?
 
Answer:
There is no need for an additional check. The caller of flush_local_count() has already invoked globalize_count(), so the check on line 28 will have succeeded, skipping the later pthread_kill().

Quick Quiz [*].43: 
The code in Figure [*], works with gcc and POSIX. What would be required to make it also conform to the ISO C standard?
 
Answer:
The theft variable must be of type sig_atomic_t to guarantee that it can be safely shared between the signal handler and the code interrupted by the signal.

Quick Quiz [*].44: 
In Figure [*], why does line 41 resend the signal?
 
Answer:
Because many operating systems over several decades have had the property of losing the occasional signal. Whether this is a feature or a bug is debatable, but irrelevant. The obvious symptom from the user's viewpoint will not be a kernel bug, but rather a user application hanging.

Your user application hanging!

Quick Quiz [*].45: 
What if you want an exact limit counter to be exact only for its lower limit?
 
Answer:
One simple solution is to overstate the upper limit by the desired amount. The limiting case of such overstatement results in the upper limit being set to the largest value that the counter is capable of representing.

Quick Quiz [*].46: 
What else had you better have done when using a biased counter?
 
Answer:
You had better have set the upper limit to be large enough accommodate the bias, the expected maximum number of accesses, and enough ``slop'' to allow the counter to work efficiently even when the number of accesses is at its maximum.

Quick Quiz [*].47: 
This is ridiculous! We are read-acquiring a reader-writer lock to update the counter? What are you playing at???
 
Answer:
Strange, perhaps, but true! Almost enough to make you think that the name ``reader-writer lock'' was poorly chosen, isn't it?

Quick Quiz [*].48: 
What other issues would need to be accounted for in a real system?
 
Answer:
A huge number!

Here are a few to start with:

  1. There could be any number of devices, so that the global variables are inappropriate, as are the lack of arguments to functions like do_io().
  2. Polling loops can be problematic in real systems. In many cases, it is far better to have the last completing I/O wake up the device-removal thread.
  3. The I/O might fail, and so do_io() will likely need a return value.
  4. If the device fails, the last I/O might never complete. In such cases, there might need to be some sort of timeout to allow error recovery.
  5. Both add_count() and sub_count() can fail, but their return values are not checked.
  6. Reader-writer locks do not scale well. One way of avoiding the high read-acquisition costs of reader-writer locks is presented in Chapter [*].

Quick Quiz [*].49: 
On the count_stat.c row of Table [*], we see that the update side scales linearly with the number of threads. How is that possible given that the more threads there are, the more per-thread counters must be summed up?
 
Answer:
The read-side code must scan the entire fixed-size array, regardless of the number of threads, so there is no difference in performance. In contrast, in the last two algorithms, readers must do more work when there are more threads. In addition, the last two algorithms interpose an additional level of indirection because they map from integer thread ID to the corresponding __thread variable.

Quick Quiz [*].50: 
Even on the last row of Table [*], the read-side performance of these statistical counter implementations is pretty horrible. So why bother with them?
 
Answer:
``Use the right tool for the job.''

As can be seen from Figure [*], single-variable atomic increment need not apply for any job involving heavy use of parallel updates. In contrast, the algorithms shown in Table [*] do an excellent job of handling update-heavy situations. Of course, if you have a read-mostly situation, you should use something else, for example, a single atomically incremented variable that can be read out using a single load.

Quick Quiz [*].51: 
Given the performance data shown in Table [*], we should always prefer update-side signals over read-side atomic operations, right?
 
Answer:
That depends on the workload. Note that you need a million readers (with roughly a 40-nanosecond performance gain) to make up for even one writer (with almost a 40-millisecond performance loss). Although there are no shortage of workloads with far greater read intensity, you will need to consider your particular workload.

In addition, although memory barriers have historically been expensive compared to ordinary instructions, you should check this on the specific hardware you will be running. The properties of computer hardware do change over time, and algorithms must change accordingly.

Quick Quiz [*].52: 
Can advanced techniques be applied to address the lock contention for readers seen in Table [*]?
 
Answer:
There are a number of ways one might go about this, and these are left as exercises for the reader.

Quick Quiz [*].53: 
The ++ operator works just fine for 1,000-digit numbers! Haven't you heard of operator overloading???
 
Answer:
In the C++ language, you might well be able to use ++ on a 1,000-digit number, assuming that you had access to a class implementing such numbers. But as of 2010, the C language does not permit operator overloading.

Quick Quiz [*].54: 
But if we are going to have to partition everything, why bother with shared-memory multithreading? Why not just partition the problem completely and run as multiple processes, each in its own address space?
 
Answer:
Indeed, multiple processes with separate address spaces can be an excellent way to exploit parallelism, as the proponents of the fork-join methodology and the Erlang language would be very quick to tell you. However, there are also some advantages to shared-memory parallelism:

  1. Only the most performance-critical portions of the application must be partitioned, and such portions are usually a small fraction of the application.
  2. Although cache misses are quite slow compared to individual register-to-register instructions, they are typically considerably faster than inter-process-communication primitives, which in turn are considerably faster than things like TCP/IP networking.
  3. Shared-memory multiprocessors are readily available and quite inexpensive, so, in stark contrast to the 1990s, there is little cost penalty for use of shared-memory parallelism.
As always, use the right tool for the job!

Paul E. McKenney 2011-12-16