Quick Quiz .2:
Network-packet counting problem.
Suppose that you need to collect statistics on the number
of networking packets (or total number of bytes) transmitted
and/or received.
Packets might be transmitted or received by any CPU on
the system.
Suppose further that this large machine is capable of
handling a million packets per second, and that there
is a systems-monitoring package that reads out the count
every five seconds.
How would you implement this statistical counter?
Answer:
Hint: the act of updating the counter must be blazingly
fast, but because the counter is read out only about once
in five million updates, the act of reading out the counter can be
quite slow.
In addition, the value read out normally need not be all that
accurate--after all, since the counter is updated a thousand
times per millisecond, we should be able to work with a value
that is within a few thousand counts of the ``true value'',
whatever ``true value'' might mean in this context.
However, the value read out should maintain roughly the same
absolute error over time.
For example, a 1% error might be just fine when the count
is on the order of a million or so, but might be absolutely
unacceptable once the count reaches a trillion.
See Section .
Quick Quiz .3:
Approximate structure-allocation limit problem.
Suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations
once the number of structures in use exceeds a limit
(say, 10,000).
Suppose further that these structures are short-lived,
that the limit is rarely exceeded, and that a ``sloppy''
approximate limit is acceptable.
Answer:
Hint: the act of updating the counter must be blazingly
fast, but the counter is read out each time that the
counter is increased.
However, the value read out need not be accurate
except that it absolutely must distinguish perfectly
between values below the limit and values greater than or
equal to the limit.
See Section .
Quick Quiz .4:
Exact structure-allocation limit problem.
Suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations
once the number of structures in use exceeds an exact limit
(say, 10,000).
Suppose further that these structures are short-lived,
and that the limit is rarely exceeded, that there is almost
always at least one structure in use, and suppose further
still that it is necessary to know exactly when this counter reaches
zero, for example, in order to free up some memory
that is not required unless there is at least one structure
in use.
Answer:
Hint: the act of updating the counter must be blazingly
fast, but the counter is read out each time that the
counter is increased.
However, the value read out need not be accurate
except that it absolutely must distinguish perfectly
between values between the limit and zero on the one hand,
and values that either are less than or equal to zero or
are greater than or equal to the limit on the other hand.
See Section .
Quick Quiz .5:
Removable I/O device access-count problem.
Suppose that you need to maintain a reference count on a
heavily used removable mass-storage device, so that you
can tell the user when it is safe to removed the device.
This device follows the usual removal procedure where
the user indicates a desire to remove the device, and
the system tells the user when it is safe to do so.
Answer:
Hint: the act of updating the counter must be blazingly
fast and scalable in order to avoid slowing down I/O operations,
but because the counter is read out only when the
user wishes to remove the device, the counter read-out
operation can be extremely slow.
Furthermore, there is no need to be able to read out
the counter at all unless the user has already indicated
a desire to remove the device.
In addition, the value read out need not be accurate
except that it absolutely must distinguish perfectly
between non-zero and zero values.
However, once it has read out a zero value, it must act
to keep the value at zero until it has taken some action
to prevent subsequent threads from gaining access to the
device being removed.
See Section .
Quick Quiz .6:
But doesn't the ++ operator produce an x86 add-to-memory
instruction?
And won't the CPU cache cause this to be atomic?
Answer:
Although the ++ operator could be atomic, there
is no requirement that it be so.
Furthermore, the ACCESS_ONCE() primitive forces most
version of gcc to load the value to a register, increment
the register, then store the value to memory, which is
decidedly non-atomic.
Quick Quiz .7:
The 8-figure accuracy on the number of failures indicates
that you really did test this.
Why would it be necessary to test such a trivial program,
especially when the bug is easily seen by inspection?
Answer:
There are no trivial parallel programs, and most days I am
not so sure that there are trivial sequential programs, either.
No matter how small or simple the program, if you haven't tested it, it does not work. And even if you have tested it, Murphy says there are at least a few bugs still lurking.
Furthermore, while proofs of correctness certainly do have their place, they never will replace testing, including the counttorture.h test setup used here. After all, proofs can have bugs just as easily has can programs!
Quick Quiz .8:
Why doesn't the dashed line on the x axis meet the
diagonal line at ?
Answer:
Because of the overhead of the atomic operation.
The dashed line on the x axis represents the overhead of
a single non-atomic increment.
After all, an ideal algorithm would not only scale
linearly, it would also incur no performance penalty compared
to single-threaded code.
This level of idealism may seem severe, but if it is good enough for Linus Torvalds, it is good enough for you.
Quick Quiz .9:
But atomic increment is still pretty fast.
And incrementing a single variable in a tight loop sounds
pretty unrealistic to me, after all, most of the program's
execution should be devoted to actually doing work, not accounting
for the work it has done!
Why should I care about making this go faster?
Answer:
In many cases, atomic increment will in fact be fast enough
for you.
In those cases, you should by all means use atomic increment.
That said, there are many real-world situations where
more elaborate counting algorithms are required.
The canonical example of such a situation is counting packets
and bytes in highly optimized networking stacks, where it is
all too easy to find much of the execution time going into
these sorts of accounting tasks, especially on large
multiprocessors.
In addition, counting provides an excellent view of the issues encountered in shared-memory parallel programs.
Quick Quiz .10:
But why can't CPU designers simply ship the operation to the
data, avoiding the need to circulate the cache line containing
the global variable being incremented?
Answer:
It might well be possible to do this in some cases.
However, there are a few complications:
If either or both of the first two conditions hold, there is
some hope for improvement.
One could imagine the hardware implementing a combining tree,
so that the increment requests from multiple CPUs are combined
by the hardware into a single addition when the combined request
reaches the hardware.
The hardware could also apply an order to the requests, thus
returning to each CPU the return value corresponding to its
particular atomic increment.
This results in instruction latency that varies as ,
where
is the number of CPUs, as shown in
Figure
.
This is a great improvement over the performance
of current hardware shown in
Figure
,
and it is possible that hardware latencies might decrease
somewhat if innovations such as three-D fabrication prove
practical.
Nevertheless, we will see that in some important special cases,
software can do much better.
Quick Quiz .11:
But doesn't the fact that C's ``integers'' are limited in size
complicate things?
Answer:
No, because modulo addition is still commutative and associative.
At least as long as you use unsigned integer.
Recall that in the C standard, overflow of signed integers results
in undefined behavior (never mind the fact that machines that
do anything other than wrap on overflow are quite rare these days.
That said, one potential source of additional complexity arises when attempting to gather (say) a 64-bit sum from 32-bit per-thread counters. For the moment, dealing with this added complexity is left as an exercise for the reader.
Quick Quiz .12:
An array???
But doesn't that limit the number of threads?
Answer:
It can, and in this toy implementation, it does.
But it is not that hard to come up with an alternative
implementation that permits an arbitrary number of threads.
However, this is left as an exercise for the reader.
Quick Quiz .13:
What other choice does gcc have, anyway???
Answer:
According to the C standard, the effects of fetching a variable
that might be concurrently modified by some other thread are
undefined.
It turns out that the C standard really has no other choice,
given that C must support (for example) eight-bit architectures
which are incapable of atomically loading a long.
An upcoming version of the C standard aims to fill this gap,
but until then, we depend on the kindness of the gcc developers.
Quick Quiz .14:
How does the per-thread counter variable in
Figure
get initialized?
Answer:
The C standard specifies that the initial value of
global variables is zero, unless they are explicitly initialized.
So the initial value of all the instances of counter
will be zero.
That said, one often takes differences of consecutive reads from statistical counters, in which case the initial value is irrelevant.
Quick Quiz .15:
How is the code in
Figure
supposed to permit more than one counter?
Answer:
Indeed, this toy example does not support more than one counter.
Modifying it so that it can provide multiple counters is left
as an exercise to the reader.
Quick Quiz .16:
Why does inc_count() in
Figure
need to use atomic instructions?
Answer:
If non-atomic instructions were used, counts could be lost.
Quick Quiz .17:
Won't the single global thread in the function eventual() of
Figure
be just as severe a bottleneck as a global lock would be?
Answer:
In this case, no.
What will happen instead is that the estimate of the counter
value returned by read_count() will become more inaccurate.
Quick Quiz .18:
Won't the estimate returned by read_count() in
Figure
become increasingly
inaccurate as the number of threads rises?
Answer:
Yes.
If this proves problematic, one fix is to provide multiple
eventual() threads, each covering its own subset of
the other threads.
In even more extreme cases, a tree-like hierarchy of
eventual() threads might be required.
Quick Quiz .19:
Why do we need an explicit array to find the other threads'
counters?
Why doesn't gcc provide a per_thread() interface, similar
to the Linux kernel's per_cpu() primitive, to allow
threads to more easily access each others' per-thread variables?
Answer:
Why indeed?
To be fair, gcc faces some challenges that the Linux kernel gets to ignore. When a user-level thread exits, its per-thread variables all disappear, which complicates the problem of per-thread-variable access, particularly before the advent of user-level RCU. In contrast, in the Linux kernel, when a CPU goes offline, that CPU's per-CPU variables remain mapped and accessible.
Similarly, when a new user-level thread is created, its per-thread variables suddenly come into existence. In contrast, in the Linux kernel, all per-CPU variables are mapped and initialized at boot time, regardless of whether the corresponding CPU exists yet, or indeed, whether the corresponding CPU will ever exist.
A key limitation that the Linux kernel imposes is a compile-time maximum limit on the number of CPUs, namely, CONFIG_NR_CPUS. In contrast, in user space, there is no hard-coded upper limit on the number of threads.
Of course, both environments must deal with dynamically loaded code (dynamic libraries in user space, kernel modules in the Linux kernel), which increases the complexity of per-thread variables in both environments.
These complications make it significantly harder for user-space environments to provide access to other threads' per-thread variables. Nevertheless, such access is highly useful, and it is hoped that it will someday appear.
Quick Quiz .20:
Why on earth do we need something as heavyweight as a lock
guarding the summation in the function read_count() in
Figure ?
Answer:
Remember, when a thread exits, its per-thread variables disappear.
Therefore, if we attempt to access a given thread's per-thread
variables after that thread exits, we will get a segmentation
fault.
The lock coordinates summation and thread exit, preventing this
scenario.
Of course, we could instead read-acquire a reader-writer lock,
but Chapter will introduce even
lighter-weight mechanisms for implementing the required coordination.
Quick Quiz .21:
Why on earth do we need to acquire the lock in
count_register_thread() in
Figure ?
It is a single properly aligned machine-word store to a location
that no other thread is modifying, so it should be atomic anyway,
right?
Answer:
This lock could in fact be omitted, but better safe than
sorry, especially given that this function is executed only at
thread startup, and is therefore not on any critical path.
Now, if we were testing on machines with thousands of CPUs,
we might need to omit the lock, but on machines with ``only''
a hundred or so CPUs, no need to get fancy.
Quick Quiz .22:
Fine, but the Linux kernel doesn't have to acquire a lock
when reading out the aggregate value of per-CPU counters.
So why should user-space code need to do this???
Answer:
Remember, the Linux kernel's per-CPU variables are always
accessible, even if the corresponding CPU is offline -- even
if the corresponding CPU never existed and never will exist.
One workaround is to ensure that each thread sticks around
until all threads are finished, as shown in
Figure .
Analysis of this code is left as an exercise to the reader,
however, please note that it does not fit well into the
counttorture.h counter-evaluation scheme.
(Why not?)
Chapter
will introduce
synchronization mechanisms that handle this situation in a much
more graceful manner.
Quick Quiz .23:
What fundamental difference is there between counting packets
and counting the total number of bytes in the packets, given
that the packets vary in size?
Answer:
When counting packets, the counter is only incremented by the
value one.
On the other hand, when counting bytes, the counter might
be incremented by largish numbers.
Why does this matter? Because in the increment-by-one case, the value returned will be exact in the sense that the counter must necessarily have taken on that value at some point in time, even if it is impossible to say precisely when that point occurred. In contrast, when counting bytes, two different threads might return values that are inconsistent with any global ordering of operations.
To see this, suppose that thread 0 adds the value three to its counter, thread 1 adds the value five to its counter, and threads 2 and 3 sum the counters. If the system is ``weakly ordered'' or if the compiler uses aggressive optimizations, thread 2 might find the sum to be three and thread 3 might find the sum to be five. The only possible global orders of the sequence of values of the counter are 0,3,8 and 0,5,8, and neither order is consistent with the results obtained.
If you missed this one, you are not alone. Michael Scott used this question to stump Paul McKenney during Paul's Ph.D. defense.
Quick Quiz .24:
Given that the reader must sum all the threads' counters,
this could take a long time given large numbers of threads.
Is there any way that the increment operation can remain
fast and scalable while allowing readers to also enjoy
reasonable performance and scalability?
Answer:
One approach would be to maintain a global approximation
to the value.
Readers would increment their per-thread variable, but when it
reached some predefined limit, atomically add it to a global
variable, then zero their per-thread variable.
This would permit a tradeoff between average increment overhead
and accuracy of the value read out.
The reader is encouraged to think up and try out other approaches, for example, using a combining tree.
Quick Quiz .25:
What is with the strange form of the condition on line 3 of
Figure ?
Why not the following more intuitive form of the fastpath?
3 if (counter + delta <= countermax){ 4 counter += delta; 5 return 1; 6 } |
Try the above formulation with counter equal to 10 and
delta equal to ULONG_MAX.
Then try it again with the code shown in
Figure .
A good understanding of integer overflow will be required for the rest of this example, so if you have never dealt with integer overflow before, please try several examples to get the hang of it. Integer overflow can sometimes be more difficult to get right than parallel algorithms!
Quick Quiz .26:
Why do globalize_count() to zero the per-thread variables,
only to later call balance_count() to refill them in
Figure ?
Why not just leave the per-thread variables non-zero?
Answer:
That is in fact what an earlier version of this code did.
But addition and subtraction are extremely cheap, and handling
all of the special cases that arise is quite complex.
Again, feel free to try it yourself, but beware of integer
overflow!
Quick Quiz .27:
Given that globalreserve counted against us in add_count(),
why doesn't it count for us in sub_count() in
Figure ?
Answer:
The globalreserve variable tracks the sum of all threads'
countermax variables.
The sum of these threads' counter variables might be anywhere
from zero to globalreserve.
We must therefore take a conservative approach, assuming that
all threads' counter variables are full in add_count()
and that they are all empty in sub_count().
But remember this question, as we will come back to it later.
Quick Quiz .28:
Why have both add_count() and sub_count() in
Figure ?
Why not simply pass a negative number to add_count()?
Answer:
Given that add_count() takes an unsigned long
as its argument, it is going to be a bit tough to pass it a
negative number.
And unless you have some anti-matter memory, there is little
point in allowing negative numbers when counting the number
of structures in use!
Quick Quiz .29:
In what way does line 7 of
Figure
violate the C standard?
Answer:
It assumes eight bits per byte.
This assumption does hold for all current commodity microprocessors
that can be easily assembled into shared-memory multiprocessors,
but certainly does not hold for all computer systems that have
ever run C code.
(What could you do instead in order to comply with the C
standard? What drawbacks would it have?)
Quick Quiz .30:
Given that there is only one counterandmax variable,
why bother passing in a pointer to it on line 18 of
Figure ?
Answer:
There is only one counterandmax variable per thread.
Later, we will see code that needs to pass other threads'
counterandmax variables to split_counterandmax().
Quick Quiz .31:
Why does merge_counterandmax() in
Figure
return an int rather than storing directly into an
atomic_t?
Answer:
Later, we will see that we need the int return to pass
to the atomic_cmpxchg() primitive.
Quick Quiz .32:
Yecch!
Why the ugly goto on line 11 of
Figure ?
Haven't you heard of the break statement???
Answer:
Replacing the goto with a break would require keeping
a flag to determine whether or not line 15 should return, which
is not the sort of thing you want on a fastpath.
If you really hate the goto that much, your best bet would
be to pull the fastpath into a separate function that returned
success or failure, with ``failure'' indicating a need for the
slowpath.
This is left as an exercise for goto-hating readers.
Quick Quiz .33:
Why would the atomic_cmpxchg() primitive at lines 13-14 of
Figure
ever fail?
After all, we picked up its old value on line 9 and have not
changed it!
Answer:
Later, we will see how the flush_local_count() function in
Figure
might update this thread's counterandmax variable concurrently
with the execution of the fastpath on lines 8-14 of
Figure
.
Quick Quiz .34:
What stops a thread from simply refilling its
counterandmax variable immediately after
flush_local_count() on line 14 of
Figure
empties it?
Answer:
This other thread cannot refill its counterandmax
until the caller of flush_local_count() releases the
gblcnt_mutex.
By that time, the caller of flush_local_count() will have
finished making use of the counts, so there will be no problem
with this other thread refilling -- assuming that the value
of globalcount is large enough to permit a refill.
Quick Quiz .35:
What prevents concurrent execution of the fastpath of either
atomic_add() or atomic_sub() from interfering with
the counterandmax variable while
flush_local_count() is accessing it on line 27 of
Figure
empties it?
Answer:
Nothing.
Consider the following three cases:
Quick Quiz .36:
Given that the atomic_set() primitive does a simple
store to the specified atomic_t, how can line 53 of
balance_count() in
Figure
work correctly in face of concurrent flush_local_count()
updates to this variable?
Answer:
The caller of both balance_count() and
flush_local_count() hold gblcnt_mutex, so
only one may be executing at a given time.
Quick Quiz .37:
But signal handlers can be migrated to some other
CPU while running.
Doesn't this possibility require that atomic instructions
and memory barriers are required to reliably communicate
between a thread and a signal handler that interrupts that
thread?
Answer:
No.
If the signal handler is migrated to another CPU, then the
interrupted thread is also migrated along with it.
Quick Quiz .38:
In Figure , why is
the REQ theft state colored blue?
Answer:
To indicate that only the fastpath is permitted to change the
theft state.
Quick Quiz .39:
In Figure , what is
the point of having separate REQ and ACK theft states?
Why not simplify the state machine by collapsing
them into a single state?
Then whichever of the signal handler or the fastpath gets there
first could set the state to READY.
Answer:
Reasons why collapsing the REQ and ACK states would be a very
bad idea include:
Quick Quiz .40:
In Figure
function flush_local_count_sig(), why are there
ACCESS_ONCE() wrappers around the uses of the
theft per-thread variable?
Answer:
The first one (on line 11) can be argued to be unnecessary.
The last two (lines 14 and 16) are important.
If these are removed, the compiler would be within its rights
to rewrite lines 14-17 as follows:
14 theft = THEFT_READY; 15 if (counting) { 16 theft = THEFT_ACK; 17 } |
Quick Quiz .41:
In Figure ,
why is it safe for line 28 to directly access the other thread's
countermax variable?
Answer:
Because the other thread is not permitted to change the value
of its countermax variable unless it holds the
gblcnt_mutex lock.
But the caller has acquired this lock, so it is not possible
for the other thread to hold it, and therefore the other thread
is not permitted to change its countermax variable.
We can therefore safely access it -- but not change it.
Quick Quiz .42:
In Figure ,
why doesn't line 33 check for the current thread sending itself
a signal?
Answer:
There is no need for an additional check.
The caller of flush_local_count() has already invoked
globalize_count(), so the check on line 28 will have
succeeded, skipping the later pthread_kill().
Quick Quiz .43:
The code in
Figure ,
works with gcc and POSIX.
What would be required to make it also conform to the ISO C standard?
Answer:
The theft variable must be of type sig_atomic_t
to guarantee that it can be safely shared between the signal
handler and the code interrupted by the signal.
Quick Quiz .44:
In Figure , why does line 41 resend the signal?
Answer:
Because many operating systems over several decades have
had the property of losing the occasional signal.
Whether this is a feature or a bug is debatable, but
irrelevant.
The obvious symptom from the user's viewpoint will not be
a kernel bug, but rather a user application hanging.
Your user application hanging!
Quick Quiz .45:
What if you want an exact limit counter to be exact only for
its lower limit?
Answer:
One simple solution is to overstate the upper limit by the
desired amount.
The limiting case of such overstatement results in the
upper limit being set to the largest value that the counter is
capable of representing.
Quick Quiz .46:
What else had you better have done when using a biased counter?
Answer:
You had better have set the upper limit to be large enough
accommodate the bias, the expected maximum number of accesses,
and enough ``slop'' to allow the counter to work efficiently
even when the number of accesses is at its maximum.
Quick Quiz .47:
This is ridiculous!
We are read-acquiring a reader-writer lock to
update the counter?
What are you playing at???
Answer:
Strange, perhaps, but true!
Almost enough to make you think that the name
``reader-writer lock'' was poorly chosen, isn't it?
Quick Quiz .48:
What other issues would need to be accounted for in a real system?
Answer:
A huge number!
Here are a few to start with:
Quick Quiz .49:
On the count_stat.c row of
Table ,
we see that the update side scales linearly with the number of
threads.
How is that possible given that the more threads there are,
the more per-thread counters must be summed up?
Answer:
The read-side code must scan the entire fixed-size array, regardless
of the number of threads, so there is no difference in performance.
In contrast, in the last two algorithms, readers must do more
work when there are more threads.
In addition, the last two algorithms interpose an additional
level of indirection because they map from integer thread ID
to the corresponding __thread variable.
Quick Quiz .50:
Even on the last row of
Table ,
the read-side performance of these statistical counter
implementations is pretty horrible.
So why bother with them?
Answer:
``Use the right tool for the job.''
As can be seen from
Figure ,
single-variable atomic increment need not apply for any job
involving heavy use of parallel updates.
In contrast, the algorithms shown in
Table
do an excellent job of handling update-heavy situations.
Of course, if you have a read-mostly situation, you should
use something else, for example, a single atomically incremented
variable that can be read out using a single load.
Quick Quiz .51:
Given the performance data shown in
Table ,
we should always prefer update-side signals over read-side
atomic operations, right?
Answer:
That depends on the workload.
Note that you need a million readers (with roughly
a 40-nanosecond performance gain) to make up for even one
writer (with almost a 40-millisecond performance loss).
Although there are no shortage of workloads with far greater
read intensity, you will need to consider your particular
workload.
In addition, although memory barriers have historically been expensive compared to ordinary instructions, you should check this on the specific hardware you will be running. The properties of computer hardware do change over time, and algorithms must change accordingly.
Quick Quiz .52:
Can advanced techniques be applied to address the lock
contention for readers seen in
Table ?
Answer:
There are a number of ways one might go about this, and these
are left as exercises for the reader.
Quick Quiz .53:
The ++ operator works just fine for 1,000-digit numbers!
Haven't you heard of operator overloading???
Answer:
In the C++ language, you might well be able to use ++
on a 1,000-digit number, assuming that you had access to a
class implementing such numbers.
But as of 2010, the C language does not permit operator overloading.
Quick Quiz .54:
But if we are going to have to partition everything, why bother
with shared-memory multithreading?
Why not just partition the problem completely and run as
multiple processes, each in its own address space?
Answer:
Indeed, multiple processes with separate address spaces can be
an excellent way to exploit parallelism, as the proponents of
the fork-join methodology and the Erlang language would be very
quick to tell you.
However, there are also some advantages to shared-memory parallelism:
Paul E. McKenney 2011-12-16