F.7 Chapter [*]

Quick Quiz [*].1: 
Why not implement reference-acquisition using a simple compare-and-swap operation that only acquires a reference if the reference counter is non-zero?
 
Answer:
Although this can resolve the race between the release of the last reference and acquisition of a new reference, it does absolutely nothing to prevent the data structure from being freed and reallocated, possibly as some completely different type of structure. It is quite likely that the ``simple compare-and-swap operation'' would give undefined results if applied to the differently typed structure.

In short, use of atomic operations such as compare-and-swap absolutely requires either type-safety or existence guarantees.

Quick Quiz [*].2: 
Why isn't it necessary to guard against cases where one CPU acquires a reference just after another CPU releases the last reference?
 
Answer:
Because a CPU must already hold a reference in order to legally acquire another reference. Therefore, if one CPU releases the last reference, there cannot possibly be any CPU that is permitted to acquire a new reference. This same fact allows the non-atomic check in line 22 of Figure [*].

Quick Quiz [*].3: 
If the check on line 22 of Figure [*] fails, how could the check on line 23 possibly succeed?
 
Answer:
Suppose that kref_put() is protected by RCU, so that two CPUs might be executing line 22 concurrently. Both might see the value ``2'', causing both to then execute line 23. One of the two instances of atomic_dec_and_test() will decrement the value to zero and thus return 1.

Quick Quiz [*].4: 
How can it possibly be safe to non-atomically check for equality with ``1'' on line 22 of Figure [*]?
 
Answer:
Remember that it is not legal to call either kref_get() or kref_put() unless you hold a reference. If the reference count is equal to ``1'', then there can't possibly be another CPU authorized to change the value of the reference count.

Quick Quiz [*].5: 
Why can't the check for a zero reference count be made in a simple ``if'' statement with an atomic increment in its ``then'' clause?
 
Answer:
Suppose that the ``if'' condition completed, finding the reference counter value equal to one. Suppose that a release operation executes, decrementing the reference counter to zero and therefore starting cleanup operations. But now the ``then'' clause can increment the counter back to a value of one, allowing the object to be used after it has been cleaned up.

Quick Quiz [*].6: 
But doesn't seqlock also permit readers and updaters to get work done concurrently?
 
Answer:
Yes and no. Although seqlock readers can run concurrently with seqlock writers, whenever this happens, the read_seqretry() primitive will force the reader to retry. This means that any work done by a seqlock reader running concurrently with a seqlock updater will be discarded and redone. So seqlock readers can run concurrently with updaters, but they cannot actually get any work done in this case.

In contrast, RCU readers can perform useful work even in presence of concurrent RCU updaters.

Quick Quiz [*].7: 
What prevents the list_for_each_entry_rcu() from getting a segfault if it happens to execute at exactly the same time as the list_add_rcu()?
 
Answer:
On all systems running Linux, loads from and stores to pointers are atomic, that is, if a store to a pointer occurs at the same time as a load from that same pointer, the load will return either the initial value or the value stored, never some bitwise mashup of the two. In addition, the list_for_each_entry_rcu() always proceeds forward through the list, never looking back. Therefore, the list_for_each_entry_rcu() will either see the element being added by list_add_rcu() or it will not, but either way, it will see a valid well-formed list.

Quick Quiz [*].8: 
Why do we need to pass two pointers into hlist_for_each_entry_rcu() when only one is needed for list_for_each_entry_rcu()?
 
Answer:
Because in an hlist it is necessary to check for NULL rather than for encountering the head. (Try coding up a single-pointer hlist_for_each_entry_rcu() If you come up with a nice solution, it would be a very good thing!)

Quick Quiz [*].9: 
How would you modify the deletion example to permit more than two versions of the list to be active?
 
Answer:
One way of accomplishing this is as shown in Figure [*].

Figure: Concurrent RCU Deletion
\begin{figure}{ \centering
\begin{verbatim}1 spin_lock(&mylock);
2 p = searc...
...&mylock);
8 synchronize_rcu();
9 kfree(p);
10 }\end{verbatim}
}\end{figure}

Note that this means that multiple concurrent deletions might be waiting in synchronize_rcu().

Quick Quiz [*].10: 
How many RCU versions of a given list can be active at any given time?
 
Answer:
That depends on the synchronization design. If a semaphore protecting the update is held across the grace period, then there can be at most two versions, the old and the new.

However, if only the search, the update, and the list_replace_rcu() were protected by a lock, then there could be an arbitrary number of versions active, limited only by memory and by how many updates could be completed within a grace period. But please note that data structures that are updated so frequently probably are not good candidates for RCU. That said, RCU can handle high update rates when necessary.

Quick Quiz [*].11: 
How can RCU updaters possibly delay RCU readers, given that the rcu_read_lock() and rcu_read_unlock() primitives neither spin nor block?
 
Answer:
The modifications undertaken by a given RCU updater will cause the corresponding CPU to invalidate cache lines containing the data, forcing the CPUs running concurrent RCU readers to incur expensive cache misses. (Can you design an algorithm that changes a data structure without inflicting expensive cache misses on concurrent readers? On subsequent readers?)

Quick Quiz [*].12: 
WTF? How the heck do you expect me to believe that RCU has a 100-femtosecond overhead when the clock period at 3GHz is more than 300 picoseconds?
 
Answer:
First, consider that the inner loop used to take this measurement is as follows:



  1 for (i = 0; i < CSCOUNT_SCALE; i++) {
  2   rcu_read_lock();
  3   rcu_read_unlock();
  4 }


Next, consider the effective definitions of rcu_read_lock() and rcu_read_unlock():



  1 #define rcu_read_lock()   do { } while (0)
  2 #define rcu_read_unlock() do { } while (0)


Consider also that the compiler does simple optimizations, allowing it to replace the loop with:



i = CSCOUNT_SCALE;


So the "measurement" of 100 femtoseconds is simply the fixed overhead of the timing measurements divided by the number of passes through the inner loop containing the calls to rcu_read_lock() and rcu_read_unlock(). And therefore, this measurement really is in error, in fact, in error by an arbitrary number of orders of magnitude. As you can see by the definition of rcu_read_lock() and rcu_read_unlock() above, the actual overhead is precisely zero.

It certainly is not every day that a timing measurement of 100 femtoseconds turns out to be an overestimate!

Quick Quiz [*].13: 
Why does both the variability and overhead of rwlock decrease as the critical-section overhead increases?
 
Answer:
Because the contention on the underlying rwlock_t decreases as the critical-section overhead increases. However, the rwlock overhead will not quite drop to that on a single CPU because of cache-thrashing overhead.

Quick Quiz [*].14: 
Is there an exception to this deadlock immunity, and if so, what sequence of events could lead to deadlock?
 
Answer:
One way to cause a deadlock cycle involving RCU read-side primitives is via the following (illegal) sequence of statements:



idx = srcu_read_lock(&srcucb);
synchronize_srcu(&srcucb);
srcu_read_unlock(&srcucb, idx);


The synchronize_rcu() cannot return until all pre-existing SRCU read-side critical sections complete, but is enclosed in an SRCU read-side critical section that cannot complete until the synchronize_srcu() returns. The result is a classic self-deadlock-you get the same effect when attempting to write-acquire a reader-writer lock while read-holding it.

Note that this self-deadlock scenario does not apply to RCU Classic, because the context switch performed by the synchronize_rcu() would act as a quiescent state for this CPU, allowing a grace period to complete. However, this is if anything even worse, because data used by the RCU read-side critical section might be freed as a result of the grace period completing.

In short, do not invoke synchronous RCU update-side primitives from within an RCU read-side critical section.

Quick Quiz [*].15: 
But wait! This is exactly the same code that might be used when thinking of RCU as a replacement for reader-writer locking! What gives?
 
Answer:
This is an effect of the Law of Toy Examples: beyond a certain point, the code fragments look the same. The only difference is in how we think about the code. However, this difference can be extremely important. For but one example of the importance, consider that if we think of RCU as a restricted reference counting scheme, we would never be fooled into thinking that the updates would exclude the RCU read-side critical sections.

It nevertheless is often useful to think of RCU as a replacement for reader-writer locking, for example, when you are replacing reader-writer locking with RCU.

Quick Quiz [*].16: 
Why the dip in refcnt overhead near 6 CPUs?
 
Answer:
Most likely NUMA effects. However, there is substantial variance in the values measured for the refcnt line, as can be seen by the error bars. In fact, standard deviations range in excess of 10values in some cases. The dip in overhead therefore might well be a statistical aberration.

Quick Quiz [*].17: 
What if the element we need to delete is not the first element of the list on line 9 of Figure [*]?
 
Answer:
As with Figure [*], this is a very simple hash table with no chaining, so the only element in a given bucket is the first element. The reader is again invited to adapt this example to a hash table with full chaining.

Quick Quiz [*].18: 
Why is it OK to exit the RCU read-side critical section on line 15 of Figure [*] before releasing the lock on line 17?
 
Answer:
First, please note that the second check on line 14 is necessary because some other CPU might have removed this element while we were waiting to acquire the lock. However, the fact that we were in an RCU read-side critical section while acquiring the lock guarantees that this element could not possibly have been re-allocated and re-inserted into this hash table. Furthermore, once we acquire the lock, the lock itself guarantees the element's existence, so we no longer need to be in an RCU read-side critical section.

The question as to whether it is necessary to re-check the element's key is left as an exercise to the reader.

Quick Quiz [*].19: 
Why not exit the RCU read-side critical section on line 23 of Figure [*] before releasing the lock on line 22?
 
Answer:
Suppose we reverse the order of these two lines. Then this code is vulnerable to the following sequence of events:

  1. CPU 0 invokes delete(), and finds the element to be deleted, executing through line 15. It has not yet actually deleted the element, but is about to do so.
  2. CPU 1 concurrently invokes delete(), attempting to delete this same element. However, CPU 0 still holds the lock, so CPU 1 waits for it at line 13.
  3. CPU 0 executes lines 16 and 17, and blocks at line 18 waiting for CPU 1 to exit its RCU read-side critical section.
  4. CPU 1 now acquires the lock, but the test on line 14 fails because CPU 0 has already removed the element. CPU 1 now executes line 22 (which we switched with line 23 for the purposes of this Quick Quiz) and exits its RCU read-side critical section.
  5. CPU 0 can now return from synchronize_rcu(), and thus executes line 19, sending the element to the freelist.
  6. CPU 1 now attempts to release a lock for an element that has been freed, and, worse yet, possibly reallocated as some other type of data structure. This is a fatal memory-corruption error.

Quick Quiz [*].20: 
But what if there is an arbitrarily long series of RCU read-side critical sections in multiple threads, so that at any point in time there is at least one thread in the system executing in an RCU read-side critical section? Wouldn't that prevent any data from a SLAB_DESTROY_BY_RCU slab ever being returned to the system, possibly resulting in OOM events?
 
Answer:
There could certainly be an arbitrarily long period of time during which at least one thread is always in an RCU read-side critical section. However, the key words in the description in Section [*] are ``in-use'' and ``pre-existing''. Keep in mind that a given RCU read-side critical section is conceptually only permitted to gain references to data elements that were in use at the beginning of that critical section. Furthermore, remember that a slab cannot be returned to the system until all of its data elements have been freed, in fact, the RCU grace period cannot start until after they have all been freed.

Therefore, the slab cache need only wait for those RCU read-side critical sections that started before the freeing of the last element of the slab. This in turn means that any RCU grace period that begins after the freeing of the last element will do--the slab may be returned to the system after that grace period ends.

Quick Quiz [*].21: 
Suppose that the nmi_profile() function was preemptible. What would need to change to make this example work correctly?
 
Answer:
One approach would be to use rcu_read_lock() and rcu_read_unlock() in nmi_profile(), and to replace the synchronize_sched() with synchronize_rcu(), perhaps as shown in Figure [*].

Figure: Using RCU to Wait for Mythical Preemptible NMIs to Finish
\begin{figure}{ \tt\scriptsize
\begin{verbatim}1 struct profile_buffer {
2 l...
... NULL);
32 synchronize_rcu();
33 kfree(p);
34 }\end{verbatim}
}\end{figure}

Quick Quiz [*].22: 
Why do some of the cells in Table [*] have exclamation marks (``!'')?
 
Answer:
The API members with exclamation marks (rcu_read_lock(), rcu_read_unlock(), and call_rcu()) were the only members of the Linux RCU API that Paul E. McKenney was aware of back in the mid-90s. During this timeframe, he was under the mistaken impression that he knew all that there is to know about RCU.

Quick Quiz [*].23: 
How do you prevent a huge number of RCU read-side critical sections from indefinitely blocking a synchronize_rcu() invocation?
 
Answer:
There is no need to do anything to prevent RCU read-side critical sections from indefinitely blocking a synchronize_rcu() invocation, because the synchronize_rcu() invocation need wait only for pre-existing RCU read-side critical sections. So as long as each RCU read-side critical section is of finite duration, there should be no problem.

Quick Quiz [*].24: 
The synchronize_rcu() API waits for all pre-existing interrupt handlers to complete, right?
 
Answer:
Absolutely not! And especially not when using preemptible RCU! You instead want synchronize_irq(). Alternatively, you can place calls to rcu_read_lock() and rcu_read_unlock() in the specific interrupt handlers that you want synchronize_rcu() to wait for.

Quick Quiz [*].25: 
What happens if you mix and match? For example, suppose you use rcu_read_lock() and rcu_read_unlock() to delimit RCU read-side critical sections, but then use call_rcu_bh() to post an RCU callback?
 
Answer:
If there happened to be no RCU read-side critical sections delimited by rcu_read_lock_bh() and rcu_read_unlock_bh() at the time call_rcu_bh() was invoked, RCU would be within its rights to invoke the callback immediately, possibly freeing a data structure still being used by the RCU read-side critical section! This is not merely a theoretical possibility: a long-running RCU read-side critical section delimited by rcu_read_lock() and rcu_read_unlock() is vulnerable to this failure mode.

This vulnerability disappears in -rt kernels, where RCU Classic and RCU BH both map onto a common implementation.

Quick Quiz [*].26: 
Hardware interrupt handlers can be thought of as being under the protection of an implicit rcu_read_lock_bh(), right?
 
Answer:
Absolutely not! And especially not when using preemptible RCU! If you need to access ``rcu_bh''-protected data structures in an interrupt handler, you need to provide explicit calls to rcu_read_lock_bh() and rcu_read_unlock_bh().

Quick Quiz [*].27: 
What happens if you mix and match RCU Classic and RCU Sched?
 
Answer:
In a non-PREEMPT or a PREEMPT kernel, mixing these two works "by accident" because in those kernel builds, RCU Classic and RCU Sched map to the same implementation. However, this mixture is fatal in PREEMPT_RT builds using the -rt patchset, due to the fact that Realtime RCU's read-side critical sections can be preempted, which would permit synchronize_sched() to return before the RCU read-side critical section reached its rcu_read_unlock() call. This could in turn result in a data structure being freed before the read-side critical section was finished with it, which could in turn greatly increase the actuarial risk experienced by your kernel.

In fact, the split between RCU Classic and RCU Sched was inspired by the need for preemptible RCU read-side critical sections.

Quick Quiz [*].28: 
In general, you cannot rely on synchronize_sched() to wait for all pre-existing interrupt handlers, right?
 
Answer:
That is correct! Because -rt Linux uses threaded interrupt handlers, there can be context switches in the middle of an interrupt handler. Because synchronize_sched() waits only until each CPU has passed through a context switch, it can return before a given interrupt handler completes.

If you need to wait for a given interrupt handler to complete, you should instead use synchronize_irq() or place explicit RCU read-side critical sections in the interrupt handlers that you wish to wait on.

Quick Quiz [*].29: 
Why do both SRCU and QRCU lack asynchronous call_srcu() or call_qrcu() interfaces?
 
Answer:
Given an asynchronous interface, a single task could register an arbitrarily large number of SRCU or QRCU callbacks, thereby consuming an arbitrarily large quantity of memory. In contrast, given the current synchronous synchronize_srcu() and synchronize_qrcu() interfaces, a given task must finish waiting for a given grace period before it can start waiting for the next one.

Quick Quiz [*].30: 
Under what conditions can synchronize_srcu() be safely used within an SRCU read-side critical section?
 
Answer:
In principle, you can use synchronize_srcu() with a given srcu_struct within an SRCU read-side critical section that uses some other srcu_struct. In practice, however, doing this is almost certainly a bad idea. In particular, the code shown in Figure [*] could still result in deadlock.

Figure: Multistage SRCU Deadlocks
\begin{figure}{ \centering
\begin{verbatim}1 idx = srcu_read_lock(&ssa);
2 s...
...ronize_srcu(&ssa);
9 srcu_read_unlock(&ssb, idx);\end{verbatim}
}\end{figure}

Quick Quiz [*].31: 
Why doesn't list_del_rcu() poison both the next and prev pointers?
 
Answer:
Poisoning the next pointer would interfere with concurrent RCU readers, who must use this pointer. However, RCU readers are forbidden from using the prev pointer, so it may safely be poisoned.

Quick Quiz [*].32: 
Normally, any pointer subject to rcu_dereference() must always be updated using rcu_assign_pointer(). What is an exception to this rule?
 
Answer:
One such exception is when a multi-element linked data structure is initialized as a unit while inaccessible to other CPUs, and then a single rcu_assign_pointer() is used to plant a global pointer to this data structure. The initialization-time pointer assignments need not use rcu_assign_pointer(), though any such assignments that happen after the structure is globally visible must use rcu_assign_pointer().

However, unless this initialization code is on an impressively hot code-path, it is probably wise to use rcu_assign_pointer() anyway, even though it is in theory unnecessary. It is all too easy for a "minor" change to invalidate your cherished assumptions about the initialization happening privately.

Quick Quiz [*].33: 
Are there any downsides to the fact that these traversal and update primitives can be used with any of the RCU API family members?
 
Answer:
It can sometimes be difficult for automated code checkers such as ``sparse'' (or indeed for human beings) to work out which type of RCU read-side critical section a given RCU traversal primitive corresponds to. For example, consider the code shown in Figure [*].

Figure: Diverse RCU Read-Side Nesting
\begin{figure}{ \centering
\begin{verbatim}1 rcu_read_lock();
2 preempt_disa...
.... */
6
7 preempt_enable();
8 rcu_read_unlock();\end{verbatim}
}\end{figure}

Is the rcu_dereference() primitive in an RCU Classic or an RCU Sched critical section? What would you have to do to figure this out?

Quick Quiz [*].34: 
Why wouldn't any deadlock in the RCU implementation in Figure [*] also be a deadlock in any other RCU implementation?
 
Answer:

Figure: Deadlock in Lock-Based RCU Implementation
\begin{figure}{ \scriptsize
\begin{verbatim}1 void foo(void)
2 {
3 spin_loc...
...);
17 do_whatever();
18 rcu_read_unlock();
19 }\end{verbatim}
}\end{figure}

Suppose the functions foo() and bar() in Figure [*] are invoked concurrently from different CPUs. Then foo() will acquire my_lock() on line 3, while bar() will acquire rcu_gp_lock on line 13. When foo() advances to line 4, it will attempt to acquire rcu_gp_lock, which is held by bar(). Then when bar() advances to line 14, it will attempt to acquire my_lock, which is held by foo().

Each function is then waiting for a lock that the other holds, a classic deadlock.

Other RCU implementations neither spin nor block in rcu_read_lock(), hence avoiding deadlocks.

Quick Quiz [*].35: 
Why not simply use reader-writer locks in the RCU implementation in Figure [*] in order to allow RCU readers to proceed in parallel?
 
Answer:
One could in fact use reader-writer locks in this manner. However, textbook reader-writer locks suffer from memory contention, so that the RCU read-side critical sections would need to be quite long to actually permit parallel execution [McK03].

On the other hand, use of a reader-writer lock that is read-acquired in rcu_read_lock() would avoid the deadlock condition noted above.

Quick Quiz [*].36: 
Wouldn't it be cleaner to acquire all the locks, and then release them all in the loop from lines 15-18 of Figure [*]? After all, with this change, there would be a point in time when there were no readers, simplifying things greatly.
 
Answer:
Making this change would re-introduce the deadlock, so no, it would not be cleaner.

Quick Quiz [*].37: 
Is the implementation shown in Figure [*] free from deadlocks? Why or why not?
 
Answer:
One deadlock is where a lock is held across synchronize_rcu(), and that same lock is acquired within an RCU read-side critical section. However, this situation will deadlock any correctly designed RCU implementation. After all, the synchronize_rcu() primitive must wait for all pre-existing RCU read-side critical sections to complete, but if one of those critical sections is spinning on a lock held by the thread executing the synchronize_rcu(), we have a deadlock inherent in the definition of RCU.

Another deadlock happens when attempting to nest RCU read-side critical sections. This deadlock is peculiar to this implementation, and might be avoided by using recursive locks, or by using reader-writer locks that are read-acquired by rcu_read_lock() and write-acquired by synchronize_rcu().

However, if we exclude the above two cases, this implementation of RCU does not introduce any deadlock situations. This is because only time some other thread's lock is acquired is when executing synchronize_rcu(), and in that case, the lock is immediately released, prohibiting a deadlock cycle that does not involve a lock held across the synchronize_rcu() which is the first case above.

Quick Quiz [*].38: 
Isn't one advantage of the RCU algorithm shown in Figure [*] that it uses only primitives that are widely available, for example, in POSIX pthreads?
 
Answer:
This is indeed an advantage, but do not forget that rcu_dereference() and rcu_assign_pointer() are still required, which means volatile manipulation for rcu_dereference() and memory barriers for rcu_assign_pointer(). Of course, many Alpha CPUs require memory barriers for both primitives.

Quick Quiz [*].39: 
But what if you hold a lock across a call to synchronize_rcu(), and then acquire that same lock within an RCU read-side critical section?
 
Answer:
Indeed, this would deadlock any legal RCU implementation. But is rcu_read_lock() really participating in the deadlock cycle? If you believe that it is, then please ask yourself this same question when looking at the RCU implementation in Section [*].

Quick Quiz [*].40: 
How can the grace period possibly elapse in 40 nanoseconds when synchronize_rcu() contains a 10-millisecond delay?
 
Answer:
The update-side test was run in absence of readers, so the poll() system call was never invoked. In addition, the actual code has this poll() system call commented out, the better to evaluate the true overhead of the update-side code. Any production uses of this code would be better served by using the poll() system call, but then again, production uses would be even better served by other implementations shown later in this section.

Quick Quiz [*].41: 
Why not simply make rcu_read_lock() wait when a concurrent synchronize_rcu() has been waiting too long in the RCU implementation in Figure [*]? Wouldn't that prevent synchronize_rcu() from starving?
 
Answer:
Although this would in fact eliminate the starvation, it would also mean that rcu_read_lock() would spin or block waiting for the writer, which is in turn waiting on readers. If one of these readers is attempting to acquire a lock that the spinning/blocking rcu_read_lock() holds, we again have deadlock.

In short, the cure is worse than the disease. See Section [*] for a proper cure.

Quick Quiz [*].42: 
Why the memory barrier on line 5 of synchronize_rcu() in Figure [*] given that there is a spin-lock acquisition immediately after?
 
Answer:
The spin-lock acquisition only guarantees that the spin-lock's critical section will not ``bleed out'' to precede the acquisition. It in no way guarantees that code preceding the spin-lock acquisition won't be reordered into the critical section. Such reordering could cause a removal from an RCU-protected list to be reordered to follow the complementing of rcu_idx, which could allow a newly starting RCU read-side critical section to see the recently removed data element.

Exercise for the reader: use a tool such as Promela/spin to determine which (if any) of the memory barriers in Figure [*] are really needed. See Section [*] for information on using these tools. The first correct and complete response will be credited.

Quick Quiz [*].43: 
Why is the counter flipped twice in Figure [*]? Shouldn't a single flip-and-wait cycle be sufficient?
 
Answer:
Both flips are absolutely required. To see this, consider the following sequence of events:

  1. Line 8 of rcu_read_lock() in Figure [*] picks up rcu_idx, finding its value to be zero.
  2. Line 8 of synchronize_rcu() in Figure [*] complements the value of rcu_idx, setting its value to one.
  3. Lines 10-13 of synchronize_rcu() find that the value of rcu_refcnt[0] is zero, and thus returns. (Recall that the question is asking what happens if lines 14-20 are omitted.)
  4. Lines 9 and 10 of rcu_read_lock() store the value zero to this thread's instance of rcu_read_idx and increments rcu_refcnt[0], respectively. Execution then proceeds into the RCU read-side critical section.
  5. Another instance of synchronize_rcu() again complements rcu_idx, this time setting its value to zero. Because rcu_refcnt[1] is zero, synchronize_rcu() returns immediately. (Recall that rcu_read_lock() incremented rcu_refcnt[0], not rcu_refcnt[1]!)
  6. The grace period that started in step [*] has been allowed to end, despite the fact that the RCU read-side critical section that started beforehand in step [*] has not completed. This violates RCU semantics, and could allow the update to free a data element that the RCU read-side critical section was still referencing.

Exercise for the reader: What happens if rcu_read_lock() is preempted for a very long time (hours!) just after line 8? Does this implementation operate correctly in that case? Why or why not? The first correct and complete response will be credited.

Quick Quiz [*].44: 
Given that atomic increment and decrement are so expensive, why not just use non-atomic increment on line 10 and a non-atomic decrement on line 25 of Figure [*]?
 
Answer:
Using non-atomic operations would cause increments and decrements to be lost, in turn causing the implementation to fail. See Section [*] for a safe way to use non-atomic operations in rcu_read_lock() and rcu_read_unlock().

Quick Quiz [*].45: 
Come off it! We can see the atomic_read() primitive in rcu_read_lock()!!! So why are you trying to pretend that rcu_read_lock() contains no atomic operations???
 
Answer:
The atomic_read() primitives does not actually execute atomic machine instructions, but rather does a normal load from an atomic_t. Its sole purpose is to keep the compiler's type-checking happy. If the Linux kernel ran on 8-bit CPUs, it would also need to prevent ``store tearing'', which could happen due to the need to store a 16-bit pointer with two eight-bit accesses on some 8-bit systems. But thankfully, it seems that no one runs Linux on 8-bit systems.

Quick Quiz [*].46: 
Great, if we have $N$ threads, we can have $2N$ ten-millisecond waits (one set per flip_counter_and_wait() invocation, and even that assumes that we wait only once for each thread. Don't we need the grace period to complete much more quickly?
 
Answer:
Keep in mind that we only wait for a given thread if that thread is still in a pre-existing RCU read-side critical section, and that waiting for one hold-out thread gives all the other threads a chance to complete any pre-existing RCU read-side critical sections that they might still be executing. So the only way that we would wait for $2N$ intervals would be if the last thread still remained in a pre-existing RCU read-side critical section despite all the waiting for all the prior threads. In short, this implementation will not wait unnecessarily.

However, if you are stress-testing code that uses RCU, you might want to comment out the poll() statement in order to better catch bugs that incorrectly retain a reference to an RCU-protected data element outside of an RCU read-side critical section.

Quick Quiz [*].47: 
All of these toy RCU implementations have either atomic operations in rcu_read_lock() and rcu_read_unlock(), or synchronize_rcu() overhead that increases linearly with the number of threads. Under what circumstances could an RCU implementation enjoy light-weight implementations for all three of these primitives, all having deterministic ( $O\left(1\right)$) overheads and latencies?
 
Answer:
Special-purpose uniprocessor implementations of RCU can attain this ideal [McK09a].

Quick Quiz [*].48: 
If any even value is sufficient to tell synchronize_rcu() to ignore a given task, why doesn't line 10 of Figure [*] simply assign zero to rcu_reader_gp?
 
Answer:
Assigning zero (or any other even-numbered constant) would in fact work, but assigning the value of rcu_gp_ctr can provide a valuable debugging aid, as it gives the developer an idea of when the corresponding thread last exited an RCU read-side critical section.

Quick Quiz [*].49: 
Why are the memory barriers on lines 17 and 29 of Figure [*] needed? Aren't the memory barriers inherent in the locking primitives on lines 18 and 28 sufficient?
 
Answer:
These memory barriers are required because the locking primitives are only guaranteed to confine the critical section. The locking primitives are under absolutely no obligation to keep other code from bleeding in to the critical section. The pair of memory barriers are therefore requires to prevent this sort of code motion, whether performed by the compiler or by the CPU.

Quick Quiz [*].50: 
Couldn't the update-side optimization described in Section [*] be applied to the implementation shown in Figure [*]?
 
Answer:
Indeed it could, with a few modifications. This work is left as an exercise for the reader.

Quick Quiz [*].51: 
Is the possibility o readers being preempted in line 3 of Figure [*] a real problem, in other words, is there a real sequence of events that could lead to failure? If not, why not? If so, what is the sequence of events, and how can the failure be addressed?
 
Answer:
It is a real problem, there is a sequence of events leading to failure, and there are a number of possible ways of addressing it. For more details, see the Quick Quizzes near the end of Section [*]. The reason for locating the discussion there is to (1) give you more time to think about it, and (2) because the nesting support added in that section greatly reduces the time required to overflow the counter.

Quick Quiz [*].52: 
Why not simply maintain a separate per-thread nesting-level variable, as was done in previous section, rather than having all this complicated bit manipulation?
 
Answer:
The apparent simplicity of the separate per-thread variable is a red herring. This approach incurs much greater complexity in the guise of careful ordering of operations, especially if signal handlers are to be permitted to contain RCU read-side critical sections. But don't take my word for it, code it up and see what you end up with!

Quick Quiz [*].53: 
Given the algorithm shown in Figure [*], how could you double the time required to overflow the global rcu_gp_ctr?
 
Answer:
One way would be to replace the magnitude comparison on lines 33 and 34 with an inequality check of the per-thread rcu_reader_gp variable against rcu_gp_ctr+RCU_GP_CTR_BOTTOM_BIT.

Quick Quiz [*].54: 
Again, given the algorithm shown in Figure [*], is counter overflow fatal? Why or why not? If it is fatal, what can be done to fix it?
 
Answer:
It can indeed be fatal. To see this, consider the following sequence of events:

  1. Thread 0 enters rcu_read_lock(), determines that it is not nested, and therefore fetches the value of the global rcu_gp_ctr. Thread 0 is then preempted for an extremely long time (before storing to its per-thread rcu_reader_gp variable).
  2. Other threads repeatedly invoke synchronize_rcu(), so that the new value of the global rcu_gp_ctr is now RCU_GP_CTR_BOTTOM_BIT less than it was when thread 0 fetched it.
  3. Thread 0 now starts running again, and stores into its per-thread rcu_reader_gp variable. The value it stores is RCU_GP_CTR_BOTTOM_BIT+1 greater than that of the global rcu_gp_ctr.
  4. Thread 0 acquires a reference to RCU-protected data element A.
  5. Thread 1 now removes the data element A that thread 0 just acquired a reference to.
  6. Thread 1 invokes synchronize_rcu(), which increments the global rcu_gp_ctr by RCU_GP_CTR_BOTTOM_BIT. It then checks all of the per-thread rcu_reader_gp variables, but thread 0's value (incorrectly) indicates that it started after thread 1's call to synchronize_rcu(), so thread 1 does not wait for thread 0 to complete its RCU read-side critical section.
  7. Thread 1 then frees up data element A, which thread 0 is still referencing.

Note that scenario can also occur in the implementation presented in Section [*].

One strategy for fixing this problem is to use 64-bit counters so that the time required to overflow them would exceed the useful lifetime of the computer system. Note that non-antique members of the 32-bit x86 CPU family allow atomic manipulation of 64-bit counters via the cmpxchg64b instruction.

Another strategy is to limit the rate at which grace periods are permitted to occur in order to achieve a similar effect. For example, synchronize_rcu() could record the last time that it was invoked, and any subsequent invocation would then check this time and block as needed to force the desired spacing. For example, if the low-order four bits of the counter were reserved for nesting, and if grace periods were permitted to occur at most ten times per second, then it would take more than 300 days for the counter to overflow. However, this approach is not helpful if there is any possibility that the system will be fully loaded with CPU-bound high-priority real-time threads for the full 300 days. (A remote possibility, perhaps, but best to consider it ahead of time.)

A third approach is to administratively abolish real-time threads from the system in question. In this case, the preempted process will age up in priority, thus getting to run long before the counter had a chance to overflow. Of course, this approach is less than helpful for real-time applications.

A final approach would be for rcu_read_lock() to recheck the value of the global rcu_gp_ctr after storing to its per-thread rcu_reader_gp counter, retrying if the new value of the global rcu_gp_ctr is inappropriate. This works, but introduces non-deterministic execution time into rcu_read_lock(). On the other hand, if your application is being preempted long enough for the counter to overflow, you have no hope of deterministic execution time in any case!

Quick Quiz [*].55: 
Doesn't the additional memory barrier shown on line 14 of Figure [*], greatly increase the overhead of rcu_quiescent_state?
 
Answer:
Indeed it does! An application using this implementation of RCU should therefore invoke rcu_quiescent_state sparingly, instead using rcu_read_lock() and rcu_read_unlock() most of the time.

However, this memory barrier is absolutely required so that other threads will see the store on lines 12-13 before any subsequent RCU read-side critical sections executed by the caller.

Quick Quiz [*].56: 
Why are the two memory barriers on lines 19 and 22 of Figure [*] needed?
 
Answer:
The memory barrier on line 19 prevents any RCU read-side critical sections that might precede the call to rcu_thread_offline() won't be reordered by either the compiler or the CPU to follow the assignment on lines 20-21. The memory barrier on line 22 is, strictly speaking, unnecessary, as it is illegal to have any RCU read-side critical sections following the call to rcu_thread_offline().

Quick Quiz [*].57: 
To be sure, the clock frequencies of ca-2008 Power systems were quite high, but even a 5GHz clock frequency is insufficient to allow loops to be executed in 50 picoseconds! What is going on here?
 
Answer:
Since the measurement loop contains a pair of empty functions, the compiler optimizes it away. The measurement loop takes 1,000 passes between each call to rcu_quiescent_state(), so this measurement is roughly one thousandth of the overhead of a single call to rcu_quiescent_state().

Quick Quiz [*].58: 
Why would the fact that the code is in a library make any difference for how easy it is to use the RCU implementation shown in Figures [*] and [*]?
 
Answer:
A library function has absolutely no control over the caller, and thus cannot force the caller to invoke rcu_quiescent_state() periodically. On the other hand, a library function that made many references to a given RCU-protected data structure might be able to invoke rcu_thread_online() upon entry, rcu_quiescent_state() periodically, and rcu_thread_offline() upon exit.

Quick Quiz [*].59: 
But what if you hold a lock across a call to synchronize_rcu(), and then acquire that same lock within an RCU read-side critical section? This should be a deadlock, but how can a primitive that generates absolutely no code possibly participate in a deadlock cycle?
 
Answer:
Please note that the RCU read-side critical section is in effect extended beyond the enclosing rcu_read_lock() and rcu_read_unlock(), out to the previous and next call to rcu_quiescent_state(). This rcu_quiescent_state can be thought of as a rcu_read_unlock() immediately followed by an rcu_read_lock().

Even so, the actual deadlock itself will involve the lock acquisition in the RCU read-side critical section and the synchronize_rcu(), never the rcu_quiescent_state().

Quick Quiz [*].60: 
Given that grace periods are prohibited within RCU read-side critical sections, how can an RCU data structure possibly be updated while in an RCU read-side critical section?
 
Answer:
This situation is one reason for the existence of asynchronous grace-period primitives such as call_rcu(). This primitive may be invoked within an RCU read-side critical section, and the specified RCU callback will in turn be invoked at a later time, after a grace period has elapsed.

The ability to perform an RCU update while within an RCU read-side critical section can be extremely convenient, and is analogous to a (mythical) unconditional read-to-write upgrade for reader-writer locking.

Quick Quiz [*].61: 
The statistical-counter implementation shown in Figure [*] (count_end.c) used a global lock to guard the summation in read_count(), which resulted in poor performance and negative scalability. How could you use RCU to provide read_count() with excellent performance and good scalability. (Keep in mind that read_count()'s scalability will necessarily be limited by its need to scan all threads' counters.)
 
Answer:
Hint: place the global variable finalcount and the array counterp[] into a single RCU-protected struct. At initialization time, this structure would be allocated and set to all zero and NULL.

The inc_count() function would be unchanged.

The read_count() function would use rcu_read_lock() instead of acquiring final_mutex, and would need to use rcu_dereference() to acquire a reference to the current structure.

The count_register_thread() function would set the array element corresponding to the newly created thread to reference that thread's per-thread counter variable.

The count_unregister_thread() function would need to allocate a new structure, acquire final_mutex, copy the old structure to the new one, add the outgoing thread's counter variable to the total, NULL the pointer to this same counter variable, use rcu_assign_pointer() to install the new structure in place of the old one, release final_mutex, wait for a grace period, and finally free the old structure.

Does this really work? Why or why not?

Quick Quiz [*].62: 
Section [*] showed a fanciful pair of code fragments that dealt with counting I/O accesses to removable devices. These code fragments suffered from high overhead on the fastpath (starting an I/O) due to the need to acquire a reader-writer lock. How would you use RCU to provide excellent performance and scalability? (Keep in mind that the performance of the common-case first code fragment that does I/O accesses is much more important than that of the device-removal code fragment.)
 
Answer:
Hint: replace the read-acquisitions of the reader-writer lock with RCU read-side critical sections, then adjust the device-removal code fragment to suit.

See Section [*] on Page [*] for one solution to this problem.

Paul E. McKenney 2011-12-16