In short, use of atomic operations such as compare-and-swap absolutely requires either type-safety or existence guarantees.
Quick Quiz .2:
Why isn't it necessary to guard against cases where one CPU
acquires a reference just after another CPU releases the last
reference?
Answer:
Because a CPU must already hold a reference in order
to legally acquire another reference.
Therefore, if one CPU releases the last reference,
there cannot possibly be any CPU that is permitted
to acquire a new reference.
This same fact allows the non-atomic check in line 22
of Figure .
Quick Quiz .3:
If the check on line 22 of
Figure fails, how
could the check on line 23 possibly succeed?
Answer:
Suppose that kref_put() is protected by RCU, so
that two CPUs might be executing line 22 concurrently.
Both might see the value ``2'', causing both to then
execute line 23.
One of the two instances of atomic_dec_and_test()
will decrement the value to zero and thus return 1.
Quick Quiz .4:
How can it possibly be safe to non-atomically check for equality
with ``1'' on line 22 of
Figure ?
Answer:
Remember that it is not legal to call either kref_get()
or kref_put() unless you hold a reference.
If the reference count is equal to ``1'', then there
can't possibly be another CPU authorized to change the
value of the reference count.
Quick Quiz .5:
Why can't the check for a zero reference count be
made in a simple ``if'' statement with an atomic
increment in its ``then'' clause?
Answer:
Suppose that the ``if'' condition completed, finding
the reference counter value equal to one.
Suppose that a release operation executes, decrementing
the reference counter to zero and therefore starting
cleanup operations.
But now the ``then'' clause can increment the counter
back to a value of one, allowing the object to be
used after it has been cleaned up.
Quick Quiz .6:
But doesn't seqlock also permit readers and updaters to get
work done concurrently?
Answer:
Yes and no.
Although seqlock readers can run concurrently with
seqlock writers, whenever this happens, the read_seqretry()
primitive will force the reader to retry.
This means that any work done by a seqlock reader running concurrently
with a seqlock updater will be discarded and redone.
So seqlock readers can run concurrently with updaters,
but they cannot actually get any work done in this case.
In contrast, RCU readers can perform useful work even in presence of concurrent RCU updaters.
Quick Quiz .7:
What prevents the list_for_each_entry_rcu() from
getting a segfault if it happens to execute at exactly the same
time as the list_add_rcu()?
Answer:
On all systems running Linux, loads from and stores
to pointers are atomic, that is, if a store to a pointer occurs at
the same time as a load from that same pointer, the load will return
either the initial value or the value stored, never some bitwise
mashup of the two.
In addition, the list_for_each_entry_rcu() always proceeds
forward through the list, never looking back.
Therefore, the list_for_each_entry_rcu() will either see
the element being added by list_add_rcu() or it will not,
but either way, it will see a valid well-formed list.
Quick Quiz .8:
Why do we need to pass two pointers into
hlist_for_each_entry_rcu()
when only one is needed for list_for_each_entry_rcu()?
Answer:
Because in an hlist it is necessary to check for
NULL rather than for encountering the head.
(Try coding up a single-pointer hlist_for_each_entry_rcu()
If you come up with a nice solution, it would be a very good thing!)
Quick Quiz .9:
How would you modify the deletion example to permit more than two
versions of the list to be active?
Answer:
One way of accomplishing this is as shown in
Figure .
Note that this means that multiple concurrent deletions might be waiting in synchronize_rcu().
Quick Quiz .10:
How many RCU versions of a given list can be
active at any given time?
Answer:
That depends on the synchronization design.
If a semaphore protecting the update is held across the grace period,
then there can be at most two versions, the old and the new.
However, if only the search, the update, and the list_replace_rcu() were protected by a lock, then there could be an arbitrary number of versions active, limited only by memory and by how many updates could be completed within a grace period. But please note that data structures that are updated so frequently probably are not good candidates for RCU. That said, RCU can handle high update rates when necessary.
Quick Quiz .11:
How can RCU updaters possibly delay RCU readers, given that the
rcu_read_lock() and rcu_read_unlock()
primitives neither spin nor block?
Answer:
The modifications undertaken by a given RCU updater will cause the
corresponding CPU to invalidate cache lines containing the data,
forcing the CPUs running concurrent RCU readers to incur expensive
cache misses.
(Can you design an algorithm that changes a data structure
without
inflicting expensive cache misses on concurrent readers?
On subsequent readers?)
Quick Quiz .12:
WTF?
How the heck do you expect me to believe that RCU has a
100-femtosecond overhead when the clock period at 3GHz is more than
300 picoseconds?
Answer:
First, consider that the inner loop used to
take this measurement is as follows:
1 for (i = 0; i < CSCOUNT_SCALE; i++) { 2 rcu_read_lock(); 3 rcu_read_unlock(); 4 } |
Next, consider the effective definitions of rcu_read_lock() and rcu_read_unlock():
1 #define rcu_read_lock() do { } while (0) 2 #define rcu_read_unlock() do { } while (0) |
Consider also that the compiler does simple optimizations, allowing it to replace the loop with:
i = CSCOUNT_SCALE; |
So the "measurement" of 100 femtoseconds is simply the fixed overhead of the timing measurements divided by the number of passes through the inner loop containing the calls to rcu_read_lock() and rcu_read_unlock(). And therefore, this measurement really is in error, in fact, in error by an arbitrary number of orders of magnitude. As you can see by the definition of rcu_read_lock() and rcu_read_unlock() above, the actual overhead is precisely zero.
It certainly is not every day that a timing measurement of 100 femtoseconds turns out to be an overestimate!
Quick Quiz .13:
Why does both the variability and overhead of rwlock decrease as the
critical-section overhead increases?
Answer:
Because the contention on the underlying
rwlock_t decreases as the critical-section overhead
increases.
However, the rwlock overhead will not quite drop to that on a single
CPU because of cache-thrashing overhead.
Quick Quiz .14:
Is there an exception to this deadlock immunity, and if so,
what sequence of events could lead to deadlock?
Answer:
One way to cause a deadlock cycle involving
RCU read-side primitives is via the following (illegal) sequence
of statements:
idx = srcu_read_lock(&srcucb); synchronize_srcu(&srcucb); srcu_read_unlock(&srcucb, idx); |
The synchronize_rcu() cannot return until all pre-existing SRCU read-side critical sections complete, but is enclosed in an SRCU read-side critical section that cannot complete until the synchronize_srcu() returns. The result is a classic self-deadlock-you get the same effect when attempting to write-acquire a reader-writer lock while read-holding it.
Note that this self-deadlock scenario does not apply to RCU Classic, because the context switch performed by the synchronize_rcu() would act as a quiescent state for this CPU, allowing a grace period to complete. However, this is if anything even worse, because data used by the RCU read-side critical section might be freed as a result of the grace period completing.
In short, do not invoke synchronous RCU update-side primitives from within an RCU read-side critical section.
Quick Quiz .15:
But wait!
This is exactly the same code that might be used when thinking
of RCU as a replacement for reader-writer locking!
What gives?
Answer:
This is an effect of the Law of Toy Examples:
beyond a certain point, the code fragments look the same.
The only difference is in how we think about the code.
However, this difference can be extremely important.
For but one example of the importance, consider that if we think
of RCU as a restricted reference counting scheme, we would never
be fooled into thinking that the updates would exclude the RCU
read-side critical sections.
It nevertheless is often useful to think of RCU as a replacement for reader-writer locking, for example, when you are replacing reader-writer locking with RCU.
Quick Quiz .16:
Why the dip in refcnt overhead near 6 CPUs?
Answer:
Most likely NUMA effects.
However, there is substantial variance in the values measured for the
refcnt line, as can be seen by the error bars.
In fact, standard deviations range in excess of 10values in some cases.
The dip in overhead therefore might well be a statistical aberration.
Quick Quiz .17:
What if the element we need to delete is not the first element
of the list on line 9 of
Figure ?
Answer:
As with
Figure ,
this is a very simple hash table with no chaining, so the only
element in a given bucket is the first element.
The reader is again invited to adapt this example to a hash table with
full chaining.
Quick Quiz .18:
Why is it OK to exit the RCU read-side critical section on
line 15 of
Figure
before releasing the lock on line 17?
Answer:
First, please note that the second check on line 14 is
necessary because some other
CPU might have removed this element while we were waiting
to acquire the lock.
However, the fact that we were in an RCU read-side critical section
while acquiring the lock guarantees that this element could not
possibly have been re-allocated and re-inserted into this
hash table.
Furthermore, once we acquire the lock, the lock itself guarantees
the element's existence, so we no longer need to be in an
RCU read-side critical section.
The question as to whether it is necessary to re-check the element's key is left as an exercise to the reader.
Quick Quiz .19:
Why not exit the RCU read-side critical section on
line 23 of
Figure
before releasing the lock on line 22?
Answer:
Suppose we reverse the order of these two lines.
Then this code is vulnerable to the following sequence of
events:
Quick Quiz .20:
But what if there is an arbitrarily long series of RCU
read-side critical sections in multiple threads, so that at
any point in time there is at least one thread in the system
executing in an RCU read-side critical section?
Wouldn't that prevent any data from a SLAB_DESTROY_BY_RCU
slab ever being returned to the system, possibly resulting
in OOM events?
Answer:
There could certainly be an arbitrarily long period of time
during which at least one thread is always in an RCU read-side
critical section.
However, the key words in the description in
Section
are ``in-use'' and ``pre-existing''.
Keep in mind that a given RCU read-side critical section is
conceptually only permitted to gain references to data elements
that were in use at the beginning of that critical section.
Furthermore, remember that a slab cannot be returned to the
system until all of its data elements have been freed, in fact,
the RCU grace period cannot start until after they have all been
freed.
Therefore, the slab cache need only wait for those RCU read-side critical sections that started before the freeing of the last element of the slab. This in turn means that any RCU grace period that begins after the freeing of the last element will do--the slab may be returned to the system after that grace period ends.
Quick Quiz .21:
Suppose that the nmi_profile() function was preemptible.
What would need to change to make this example work correctly?
Answer:
One approach would be to use
rcu_read_lock() and rcu_read_unlock()
in nmi_profile(), and to replace the
synchronize_sched() with synchronize_rcu(),
perhaps as shown in
Figure .
Quick Quiz .22:
Why do some of the cells in
Table
have exclamation marks (``!'')?
Answer:
The API members with exclamation marks (rcu_read_lock(),
rcu_read_unlock(), and call_rcu()) were the
only members of the Linux RCU API that Paul E. McKenney was aware
of back in the mid-90s.
During this timeframe, he was under the mistaken impression that
he knew all that there is to know about RCU.
Quick Quiz .23:
How do you prevent a huge number of RCU read-side critical
sections from indefinitely blocking a synchronize_rcu()
invocation?
Answer:
There is no need to do anything to prevent RCU read-side
critical sections from indefinitely blocking a
synchronize_rcu() invocation, because the
synchronize_rcu() invocation need wait only for
pre-existing RCU read-side critical sections.
So as long as each RCU read-side critical section is
of finite duration, there should be no problem.
Quick Quiz .24:
The synchronize_rcu() API waits for all pre-existing
interrupt handlers to complete, right?
Answer:
Absolutely not!
And especially not when using preemptible RCU!
You instead want synchronize_irq().
Alternatively, you can place calls to rcu_read_lock()
and rcu_read_unlock() in the specific interrupt handlers that
you want synchronize_rcu() to wait for.
Quick Quiz .25:
What happens if you mix and match?
For example, suppose you use rcu_read_lock() and
rcu_read_unlock() to delimit RCU read-side critical
sections, but then use call_rcu_bh() to post an
RCU callback?
Answer:
If there happened to be no RCU read-side critical
sections delimited by rcu_read_lock_bh() and
rcu_read_unlock_bh() at the time call_rcu_bh()
was invoked, RCU would be within its rights to invoke the callback
immediately, possibly freeing a data structure still being used by
the RCU read-side critical section!
This is not merely a theoretical possibility: a long-running RCU
read-side critical section delimited by rcu_read_lock()
and rcu_read_unlock() is vulnerable to this failure mode.
This vulnerability disappears in -rt kernels, where RCU Classic and RCU BH both map onto a common implementation.
Quick Quiz .26:
Hardware interrupt handlers can be thought of as being
under the protection of an implicit rcu_read_lock_bh(),
right?
Answer:
Absolutely not!
And especially not when using preemptible RCU!
If you need to access ``rcu_bh''-protected data structures
in an interrupt handler, you need to provide explicit calls to
rcu_read_lock_bh() and rcu_read_unlock_bh().
Quick Quiz .27:
What happens if you mix and match RCU Classic and RCU Sched?
Answer:
In a non-PREEMPT or a PREEMPT kernel, mixing these
two works "by accident" because in those kernel builds, RCU Classic
and RCU Sched map to the same implementation.
However, this mixture is fatal in PREEMPT_RT builds using the -rt
patchset, due to the fact that Realtime RCU's read-side critical
sections can be preempted, which would permit
synchronize_sched() to return before the
RCU read-side critical section reached its rcu_read_unlock()
call.
This could in turn result in a data structure being freed before the
read-side critical section was finished with it,
which could in turn greatly increase the actuarial risk experienced
by your kernel.
In fact, the split between RCU Classic and RCU Sched was inspired by the need for preemptible RCU read-side critical sections.
Quick Quiz .28:
In general, you cannot rely on synchronize_sched() to
wait for all pre-existing interrupt handlers,
right?
Answer:
That is correct!
Because -rt Linux uses threaded interrupt handlers, there can
be context switches in the middle of an interrupt handler.
Because synchronize_sched() waits only until each
CPU has passed through a context switch, it can return
before a given interrupt handler completes.
If you need to wait for a given interrupt handler to complete, you should instead use synchronize_irq() or place explicit RCU read-side critical sections in the interrupt handlers that you wish to wait on.
Quick Quiz .29:
Why do both SRCU and QRCU lack asynchronous call_srcu()
or call_qrcu() interfaces?
Answer:
Given an asynchronous interface, a single task
could register an arbitrarily large number of SRCU or QRCU callbacks,
thereby consuming an arbitrarily large quantity of memory.
In contrast, given the current synchronous
synchronize_srcu() and synchronize_qrcu()
interfaces, a given task must finish waiting for a given grace period
before it can start waiting for the next one.
Quick Quiz .30:
Under what conditions can synchronize_srcu() be safely
used within an SRCU read-side critical section?
Answer:
In principle, you can use
synchronize_srcu() with a given srcu_struct
within an SRCU read-side critical section that uses some other
srcu_struct.
In practice, however, doing this is almost certainly a bad idea.
In particular, the code shown in
Figure
could still result in deadlock.
Quick Quiz .31:
Why doesn't list_del_rcu() poison both the next
and prev pointers?
Answer:
Poisoning the next pointer would interfere
with concurrent RCU readers, who must use this pointer.
However, RCU readers are forbidden from using the prev
pointer, so it may safely be poisoned.
Quick Quiz .32:
Normally, any pointer subject to rcu_dereference() must
always be updated using rcu_assign_pointer().
What is an exception to this rule?
Answer:
One such exception is when a multi-element linked
data structure is initialized as a unit while inaccessible to other
CPUs, and then a single rcu_assign_pointer() is used
to plant a global pointer to this data structure.
The initialization-time pointer assignments need not use
rcu_assign_pointer(), though any such assignments that
happen after the structure is globally visible must use
rcu_assign_pointer().
However, unless this initialization code is on an impressively hot code-path, it is probably wise to use rcu_assign_pointer() anyway, even though it is in theory unnecessary. It is all too easy for a "minor" change to invalidate your cherished assumptions about the initialization happening privately.
Quick Quiz .33:
Are there any downsides to the fact that these traversal and update
primitives can be used with any of the RCU API family members?
Answer:
It can sometimes be difficult for automated
code checkers such as ``sparse'' (or indeed for human beings) to
work out which type of RCU read-side critical section a given
RCU traversal primitive corresponds to.
For example, consider the code shown in
Figure .
Is the rcu_dereference() primitive in an RCU Classic or an RCU Sched critical section? What would you have to do to figure this out?
Quick Quiz .34:
Why wouldn't any deadlock in the RCU implementation in
Figure
also be a deadlock in any other RCU implementation?
Answer:
Suppose the functions foo() and bar() in
Figure
are invoked concurrently from different CPUs.
Then foo() will acquire my_lock() on line 3,
while bar() will acquire rcu_gp_lock on
line 13.
When foo() advances to line 4, it will attempt to
acquire rcu_gp_lock, which is held by bar().
Then when bar() advances to line 14, it will attempt
to acquire my_lock, which is held by foo().
Each function is then waiting for a lock that the other holds, a classic deadlock.
Other RCU implementations neither spin nor block in rcu_read_lock(), hence avoiding deadlocks.
Quick Quiz .35:
Why not simply use reader-writer locks in the RCU implementation
in
Figure
in order to allow RCU readers to proceed in parallel?
Answer:
One could in fact use reader-writer locks in this manner.
However, textbook reader-writer locks suffer from memory
contention, so that the RCU read-side critical sections would
need to be quite long to actually permit parallel
execution [McK03].
On the other hand, use of a reader-writer lock that is read-acquired in rcu_read_lock() would avoid the deadlock condition noted above.
Quick Quiz .36:
Wouldn't it be cleaner to acquire all the locks, and then
release them all in the loop from lines 15-18 of
Figure ?
After all, with this change, there would be a point in time
when there were no readers, simplifying things greatly.
Answer:
Making this change would re-introduce the deadlock, so
no, it would not be cleaner.
Quick Quiz .37:
Is the implementation shown in
Figure
free from deadlocks?
Why or why not?
Answer:
One deadlock is where a lock is
held across synchronize_rcu(), and that same lock is
acquired within an RCU read-side critical section.
However, this situation will deadlock any correctly designed
RCU implementation.
After all, the synchronize_rcu() primitive must wait for all
pre-existing RCU read-side critical sections to complete,
but if one of those critical sections is spinning on a lock
held by the thread executing the synchronize_rcu(),
we have a deadlock inherent in the definition of RCU.
Another deadlock happens when attempting to nest RCU read-side critical sections. This deadlock is peculiar to this implementation, and might be avoided by using recursive locks, or by using reader-writer locks that are read-acquired by rcu_read_lock() and write-acquired by synchronize_rcu().
However, if we exclude the above two cases, this implementation of RCU does not introduce any deadlock situations. This is because only time some other thread's lock is acquired is when executing synchronize_rcu(), and in that case, the lock is immediately released, prohibiting a deadlock cycle that does not involve a lock held across the synchronize_rcu() which is the first case above.
Quick Quiz .38:
Isn't one advantage of the RCU algorithm shown in
Figure
that it uses only primitives that are widely available,
for example, in POSIX pthreads?
Answer:
This is indeed an advantage, but do not forget that
rcu_dereference() and rcu_assign_pointer()
are still required, which means volatile manipulation
for rcu_dereference() and memory barriers for
rcu_assign_pointer().
Of course, many Alpha CPUs require memory barriers for both
primitives.
Quick Quiz .39:
But what if you hold a lock across a call to
synchronize_rcu(), and then acquire that same lock within
an RCU read-side critical section?
Answer:
Indeed, this would deadlock any legal RCU implementation.
But is rcu_read_lock() really participating in
the deadlock cycle?
If you believe that it is, then please
ask yourself this same question when looking at the
RCU implementation in
Section .
Quick Quiz .40:
How can the grace period possibly elapse in 40 nanoseconds when
synchronize_rcu() contains a 10-millisecond delay?
Answer:
The update-side test was run in absence of readers, so the
poll() system call was never invoked.
In addition, the actual code has this poll()
system call commented out, the better to evaluate the
true overhead of the update-side code.
Any production uses of this code would be better served by
using the poll() system call, but then again,
production uses would be even better served by other implementations
shown later in this section.
Quick Quiz .41:
Why not simply make rcu_read_lock() wait when a concurrent
synchronize_rcu() has been waiting too long in
the RCU implementation in
Figure ?
Wouldn't that prevent synchronize_rcu() from starving?
Answer:
Although this would in fact eliminate the starvation, it would
also mean that rcu_read_lock() would spin or block waiting
for the writer, which is in turn waiting on readers.
If one of these readers is attempting to acquire a lock that
the spinning/blocking rcu_read_lock() holds, we again
have deadlock.
In short, the cure is worse than the disease.
See Section
for a proper cure.
Quick Quiz .42:
Why the memory barrier on line 5 of synchronize_rcu() in
Figure
given that there is a spin-lock acquisition immediately after?
Answer:
The spin-lock acquisition only guarantees that the spin-lock's
critical section will not ``bleed out'' to precede the
acquisition.
It in no way guarantees that code preceding the spin-lock
acquisition won't be reordered into the critical section.
Such reordering could cause a removal from an RCU-protected
list to be reordered to follow the complementing of
rcu_idx, which could allow a newly starting RCU
read-side critical section to see the recently removed
data element.
Exercise for the reader: use a tool such as Promela/spin
to determine which (if any) of the memory barriers in
Figure
are really needed.
See Section
for information on using these tools.
The first correct and complete response will be credited.
Quick Quiz .43:
Why is the counter flipped twice in
Figure ?
Shouldn't a single flip-and-wait cycle be sufficient?
Answer:
Both flips are absolutely required.
To see this, consider the following sequence of events:
Exercise for the reader: What happens if rcu_read_lock() is preempted for a very long time (hours!) just after line 8? Does this implementation operate correctly in that case? Why or why not? The first correct and complete response will be credited.
Quick Quiz .44:
Given that atomic increment and decrement are so expensive,
why not just use non-atomic increment on line 10 and a
non-atomic decrement on line 25 of
Figure ?
Answer:
Using non-atomic operations would cause increments and decrements
to be lost, in turn causing the implementation to fail.
See Section
for a safe way to use non-atomic operations in
rcu_read_lock() and rcu_read_unlock().
Quick Quiz .45:
Come off it!
We can see the atomic_read() primitive in
rcu_read_lock()!!!
So why are you trying to pretend that rcu_read_lock()
contains no atomic operations???
Answer:
The atomic_read() primitives does not actually execute
atomic machine instructions, but rather does a normal load
from an atomic_t.
Its sole purpose is to keep the compiler's type-checking happy.
If the Linux kernel ran on 8-bit CPUs, it would also need to
prevent ``store tearing'', which could happen due to the need
to store a 16-bit pointer with two eight-bit accesses on some
8-bit systems.
But thankfully, it seems that no one runs Linux on 8-bit systems.
Quick Quiz .46:
Great, if we have threads, we can have
ten-millisecond
waits (one set per flip_counter_and_wait() invocation,
and even that assumes that we wait only once for each thread.
Don't we need the grace period to complete much more quickly?
Answer:
Keep in mind that we only wait for a given thread if that thread
is still in a pre-existing RCU read-side critical section,
and that waiting for one hold-out thread gives all the other
threads a chance to complete any pre-existing RCU read-side
critical sections that they might still be executing.
So the only way that we would wait for intervals
would be if the last thread still remained in a pre-existing
RCU read-side critical section despite all the waiting for
all the prior threads.
In short, this implementation will not wait unnecessarily.
However, if you are stress-testing code that uses RCU, you might want to comment out the poll() statement in order to better catch bugs that incorrectly retain a reference to an RCU-protected data element outside of an RCU read-side critical section.
Quick Quiz .47:
All of these toy RCU implementations have either atomic operations
in rcu_read_lock() and rcu_read_unlock(),
or synchronize_rcu()
overhead that increases linearly with the number of threads.
Under what circumstances could an RCU implementation enjoy
light-weight implementations for all three of these primitives,
all having deterministic (
) overheads and latencies?
Answer:
Special-purpose uniprocessor implementations of RCU can attain
this ideal [McK09a].
Quick Quiz .48:
If any even value is sufficient to tell synchronize_rcu()
to ignore a given task, why doesn't line 10 of
Figure
simply assign zero to rcu_reader_gp?
Answer:
Assigning zero (or any other even-numbered constant)
would in fact work, but assigning the value of
rcu_gp_ctr can provide a valuable debugging aid,
as it gives the developer an idea of when the corresponding
thread last exited an RCU read-side critical section.
Quick Quiz .49:
Why are the memory barriers on lines 17 and 29 of
Figure
needed?
Aren't the memory barriers inherent in the locking
primitives on lines 18 and 28 sufficient?
Answer:
These memory barriers are required because the locking
primitives are only guaranteed to confine the critical
section.
The locking primitives are under absolutely no obligation
to keep other code from bleeding in to the critical section.
The pair of memory barriers are therefore requires to prevent
this sort of code motion, whether performed by the compiler
or by the CPU.
Quick Quiz .50:
Couldn't the update-side optimization described in
Section
be applied to the implementation shown in
Figure
?
Answer:
Indeed it could, with a few modifications.
This work is left as an exercise for the reader.
Quick Quiz .51:
Is the possibility o readers being preempted in
line 3 of Figure
a real problem, in other words, is there a real sequence
of events that could lead to failure?
If not, why not?
If so, what is the sequence of events, and how can the
failure be addressed?
Answer:
It is a real problem, there is a sequence of events leading to
failure, and there are a number of possible ways of
addressing it.
For more details, see the Quick Quizzes near the end of
Section .
The reason for locating the discussion there is to (1) give you
more time to think about it, and (2) because the nesting support
added in that section greatly reduces the time required to
overflow the counter.
Quick Quiz .52:
Why not simply maintain a separate per-thread nesting-level
variable, as was done in previous section, rather than having
all this complicated bit manipulation?
Answer:
The apparent simplicity of the separate per-thread variable
is a red herring.
This approach incurs much greater complexity in the guise
of careful ordering of operations, especially if signal
handlers are to be permitted to contain RCU read-side
critical sections.
But don't take my word for it, code it up and see what you
end up with!
Quick Quiz .53:
Given the algorithm shown in
Figure ,
how could you double the time required to overflow the global
rcu_gp_ctr?
Answer:
One way would be to replace the magnitude comparison on
lines 33 and 34 with an inequality check of the per-thread
rcu_reader_gp variable against
rcu_gp_ctr+RCU_GP_CTR_BOTTOM_BIT.
Quick Quiz .54:
Again, given the algorithm shown in
Figure ,
is counter overflow fatal?
Why or why not?
If it is fatal, what can be done to fix it?
Answer:
It can indeed be fatal.
To see this, consider the following sequence of events:
Note that scenario can also occur in the implementation presented in
Section .
One strategy for fixing this problem is to use 64-bit counters so that the time required to overflow them would exceed the useful lifetime of the computer system. Note that non-antique members of the 32-bit x86 CPU family allow atomic manipulation of 64-bit counters via the cmpxchg64b instruction.
Another strategy is to limit the rate at which grace periods are permitted to occur in order to achieve a similar effect. For example, synchronize_rcu() could record the last time that it was invoked, and any subsequent invocation would then check this time and block as needed to force the desired spacing. For example, if the low-order four bits of the counter were reserved for nesting, and if grace periods were permitted to occur at most ten times per second, then it would take more than 300 days for the counter to overflow. However, this approach is not helpful if there is any possibility that the system will be fully loaded with CPU-bound high-priority real-time threads for the full 300 days. (A remote possibility, perhaps, but best to consider it ahead of time.)
A third approach is to administratively abolish real-time threads from the system in question. In this case, the preempted process will age up in priority, thus getting to run long before the counter had a chance to overflow. Of course, this approach is less than helpful for real-time applications.
A final approach would be for rcu_read_lock() to recheck the value of the global rcu_gp_ctr after storing to its per-thread rcu_reader_gp counter, retrying if the new value of the global rcu_gp_ctr is inappropriate. This works, but introduces non-deterministic execution time into rcu_read_lock(). On the other hand, if your application is being preempted long enough for the counter to overflow, you have no hope of deterministic execution time in any case!
Quick Quiz .55:
Doesn't the additional memory barrier shown on line 14 of
Figure ,
greatly increase the overhead of rcu_quiescent_state?
Answer:
Indeed it does!
An application using this implementation of RCU should therefore
invoke rcu_quiescent_state sparingly, instead using
rcu_read_lock() and rcu_read_unlock() most of the
time.
However, this memory barrier is absolutely required so that other threads will see the store on lines 12-13 before any subsequent RCU read-side critical sections executed by the caller.
Quick Quiz .56:
Why are the two memory barriers on lines 19 and 22 of
Figure
needed?
Answer:
The memory barrier on line 19 prevents any RCU read-side
critical sections that might precede the
call to rcu_thread_offline() won't be reordered by either
the compiler or the CPU to follow the assignment on lines 20-21.
The memory barrier on line 22 is, strictly speaking, unnecessary,
as it is illegal to have any RCU read-side critical sections
following the call to rcu_thread_offline().
Quick Quiz .57:
To be sure, the clock frequencies of ca-2008 Power
systems were quite high, but even a 5GHz clock
frequency is insufficient to allow
loops to be executed in 50 picoseconds!
What is going on here?
Answer:
Since the measurement loop contains a pair of empty functions,
the compiler optimizes it away.
The measurement loop takes 1,000 passes between each call to
rcu_quiescent_state(), so this measurement is roughly
one thousandth of the overhead of a single call to
rcu_quiescent_state().
Quick Quiz .58:
Why would the fact that the code is in a library make
any difference for how easy it is to use the RCU
implementation shown in
Figures and
?
Answer:
A library function has absolutely no control over the caller,
and thus cannot force the caller to invoke rcu_quiescent_state()
periodically.
On the other hand, a library function that made many references
to a given RCU-protected data structure might be able to invoke
rcu_thread_online() upon entry,
rcu_quiescent_state() periodically, and
rcu_thread_offline() upon exit.
Quick Quiz .59:
But what if you hold a lock across a call to
synchronize_rcu(), and then acquire that same lock within
an RCU read-side critical section?
This should be a deadlock, but how can a primitive that
generates absolutely no code possibly participate in a
deadlock cycle?
Answer:
Please note that the RCU read-side critical section is in
effect extended beyond the enclosing
rcu_read_lock() and rcu_read_unlock(), out to
the previous and next call to rcu_quiescent_state().
This rcu_quiescent_state can be thought of as a
rcu_read_unlock() immediately followed by an
rcu_read_lock().
Even so, the actual deadlock itself will involve the lock acquisition in the RCU read-side critical section and the synchronize_rcu(), never the rcu_quiescent_state().
Quick Quiz .60:
Given that grace periods are prohibited within RCU read-side
critical sections, how can an RCU data structure possibly be
updated while in an RCU read-side critical section?
Answer:
This situation is one reason for the existence of asynchronous
grace-period primitives such as call_rcu().
This primitive may be invoked within an RCU read-side critical
section, and the specified RCU callback will in turn be invoked
at a later time, after a grace period has elapsed.
The ability to perform an RCU update while within an RCU read-side critical section can be extremely convenient, and is analogous to a (mythical) unconditional read-to-write upgrade for reader-writer locking.
Quick Quiz .61:
The statistical-counter implementation shown in
Figure
(count_end.c)
used a global lock to guard the summation in read_count(),
which resulted in poor performance and negative scalability.
How could you use RCU to provide read_count() with
excellent performance and good scalability.
(Keep in mind that read_count()'s scalability will
necessarily be limited by its need to scan all threads'
counters.)
Answer:
Hint: place the global variable finalcount and the
array counterp[] into a single RCU-protected struct.
At initialization time, this structure would be allocated
and set to all zero and NULL.
The inc_count() function would be unchanged.
The read_count() function would use rcu_read_lock() instead of acquiring final_mutex, and would need to use rcu_dereference() to acquire a reference to the current structure.
The count_register_thread() function would set the array element corresponding to the newly created thread to reference that thread's per-thread counter variable.
The count_unregister_thread() function would need to allocate a new structure, acquire final_mutex, copy the old structure to the new one, add the outgoing thread's counter variable to the total, NULL the pointer to this same counter variable, use rcu_assign_pointer() to install the new structure in place of the old one, release final_mutex, wait for a grace period, and finally free the old structure.
Does this really work? Why or why not?
Quick Quiz .62:
Section
showed a fanciful pair of code fragments that dealt with counting
I/O accesses to removable devices.
These code fragments suffered from high overhead on the fastpath
(starting an I/O) due to the need to acquire a reader-writer
lock.
How would you use RCU to provide excellent performance and
scalability?
(Keep in mind that the performance of the common-case first
code fragment that does I/O accesses is much more important
than that of the device-removal code fragment.)
Answer:
Hint: replace the read-acquisitions of the reader-writer lock
with RCU read-side critical sections, then adjust the
device-removal code fragment to suit.
See
Section
on
Page
for one solution to this problem.
Paul E. McKenney 2011-12-16