F.15 Chapter [*]

Quick Quiz [*].1: 
Why is sleeping prohibited within Classic RCU read-side critical sections?
 
Answer:
Because sleeping implies a context switch, which in Classic RCU is a quiescent state, and RCU's grace-period detection requires that quiescent states never appear in RCU read-side critical sections.

Quick Quiz [*].2: 
Why not permit sleeping in Classic RCU read-side critical sections by eliminating context switch as a quiescent state, leaving user-mode execution and idle loop as the remaining quiescent states?
 
Answer:
This would mean that a system undergoing heavy kernel-mode execution load (e.g., due to kernel threads) might never complete a grace period, which would cause it to exhaust memory sooner or later.

Quick Quiz [*].3: 
Why is it OK to assume that updates separated by synchronize_sched() will be performed in order?
 
Answer:
Because this property is required for the synchronize_sched() aspect of RCU to work at all. For example, consider a code sequence that removes an object from a list, invokes synchronize_sched(), then frees the object. If this property did not hold, then that object might appear to be freed before it was removed from the list, which is precisely the situation that synchronize_sched() is supposed to prevent!

Quick Quiz [*].4: 
Why must line 17 in synchronize_srcu() (Figure [*]) precede the release of the mutex on line 18? What would have to change to permit these two lines to be interchanged? Would such a change be worthwhile? Why or why not?
 
Answer:
Suppose that the order was reversed, and that CPU 0 has just reached line 13 of synchronize_srcu(), while both CPU 1 and CPU 2 start executing another synchronize_srcu() each, and CPU 3 starts executing a srcu_read_lock(). Suppose that CPU 1 reaches line 6 of synchronize_srcu() just before CPU 0 increments the counter on line 13. Most importantly, suppose that CPU 3 executes srcu_read_lock() out of order with the following SRCU read-side critical section, so that it acquires a reference to some SRCU-protected data structure before CPU 0 increments sp->completed, but executes the srcu_read_lock() after CPU 0 does this increment.

Then CPU 0 will not wait for CPU 3 to complete its SRCU read-side critical section before exiting the ``while'' loop on lines 15-16 and releasing the mutex (remember, the CPU could be reordering the code).

Now suppose that CPU 2 acquires the mutex next, and again increments sp->completed. This CPU will then have to wait for CPU 3 to exit its SRCU read-side critical section before exiting the loop on lines 15-16 and releasing the mutex. But suppose that CPU 3 again executes out of order, completing the srcu_read_unlock() prior to executing a final reference to the pointer it obtained when entering the SRCU read-side critical section.

CPU 1 will then acquire the mutex, but see that the sp->completed counter has incremented twice, and therefore take the early exit. The caller might well free up the element that CPU 3 is still referencing (due to CPU 3's out-of-order execution).

To prevent this perhaps improbable, but entirely possible, scenario, the final synchronize_sched() must precede the mutex release in synchronize_srcu().

Another approach would be to change to comparison on line 7 of synchronize_srcu() to check for at least three increments of the counter. However, such a change would increase the latency of a ``bulk update'' scenario, where a hash table is being updated or unloaded using multiple threads. In the current code, the latency of the resulting concurrent synchronize_srcu() calls would take at most two SRCU grace periods, while with this change, three would be required.

More experience will be required to determine which approach is really better. For one thing, there must first be some use of SRCU with multiple concurrent updaters.

Quick Quiz [*].5: 
Wait a minute! With all those new locks, how do you avoid deadlock?
 
Answer:
Deadlock is avoided by never holding more than one of the rcu_node structures' locks at a given time. This algorithm uses two more locks, one to prevent CPU hotplug operations from running concurrently with grace-period advancement (onofflock) and another to permit only one CPU at a time from forcing a quiescent state to end quickly (fqslock). These are subject to a locking hierarchy, so that fqslock must be acquired before onofflock, which in turn must be acquired before any of the rcu_node structures' locks.

Also, as a practical matter, refusing to ever hold more than one of the rcu_node locks means that it is unnecessary to track which ones are held. Such tracking would be painful as well as unnecessary.

Quick Quiz [*].6: 
Why stop at a 64-times reduction? Why not go for a few orders of magnitude instead?
 
Answer:
RCU works with no problems on systems with a few hundred CPUs, so allowing 64 CPUs to contend on a single lock leaves plenty of headroom. Keep in mind that these locks are acquired quite rarely, as each CPU will check in about one time per grace period, and grace periods extend for milliseconds.

Quick Quiz [*].7: 
But I don't care about McKenney's lame excuses in the answer to Quick Quiz 2!!! I want to get the number of CPUs contending on a single lock down to something reasonable, like sixteen or so!!!
 
Answer:
OK, have it your way, then! Set CONFIG_RCU_FANOUT=16 and (for NR_CPUS=4096) you will get a three-level hierarchy with with 256 rcu_node structures at the lowest level, 16 rcu_node structures as intermediate nodes, and a single root-level rcu_node. The penalty you will pay is that more rcu_node structures will need to be scanned when checking to see which CPUs need help completing their quiescent states (256 instead of only 64).

Quick Quiz [*].8: 
OK, so what is the story with the colors?
 
Answer:
Data structures analogous to rcu_state (including rcu_ctrlblk) are yellow, those containing the bitmaps used to determine when CPUs have checked in are pink, and the per-CPU rcu_data structures are blue. The data structures used to conserve energy (such as rcu_dynticks) will be colored green.

Quick Quiz [*].9: 
Given such an egregious bug, why does Linux run at all?
 
Answer:
Because the Linux kernel contains device drivers that are (relatively) well behaved. Few if any of them spin in RCU read-side critical sections for the many milliseconds that would be required to provoke this bug. The bug nevertheless does need to be fixed, and this variant of RCU does fix it.

Quick Quiz [*].10: 
But doesn't this state diagram indicate that dyntick-idle CPUs will get hit with reschedule IPIs? Won't that wake them up?
 
Answer:
No. Keep in mind that RCU is handling groups of CPUs. One particular group might contain both dyntick-idle CPUs and CPUs in normal mode that have somehow managed to avoid passing through a quiescent state. Only the latter group will be sent a reschedule IPI; the dyntick-idle CPUs will merely be marked as being in an extended quiescent state.

Quick Quiz [*].11: 
But what happens if a CPU tries to report going through a quiescent state (by clearing its bit) before the bit-setting CPU has finished?
 
Answer:
There are three cases to consider here:

  1. A CPU corresponding to a non-yet-initialized leaf rcu_node structure tries to report a quiescent state. This CPU will see its bit already cleared, so will give up on reporting its quiescent state. Some later quiescent state will serve for the new grace period.
  2. A CPU corresponding to a leaf rcu_node structure that is currently being initialized tries to report a quiescent state. This CPU will see that the rcu_node structure's ->lock is held, so will spin until it is released. But once the lock is released, the rcu_node structure will have been initialized, reducing to the following case.
  3. A CPU corresponding to a leaf rcu_node that has already been initialized tries to report a quiescent state. This CPU will find its bit set, and will therefore clear it. If it is the last CPU for that leaf node, it will move up to the next level of the hierarchy. However, this CPU cannot possibly be the last CPU in the system to report a quiescent state, given that the CPU doing the initialization cannot yet have checked in.

So, in all three cases, the potential race is resolved correctly.

Quick Quiz [*].12: 
And what happens if all CPUs try to report going through a quiescent state before the bit-setting CPU has finished, thus ending the new grace period before it starts?
 
Answer:
The bit-setting CPU cannot pass through a quiescent state during initialization, as it has irqs disabled. Its bits therefore remain non-zero, preventing the grace period from ending until the data structure has been fully initialized.

Quick Quiz [*].13: 
And what happens if one CPU comes out of dyntick-idle mode and then passed through a quiescent state just as another CPU notices that the first CPU was in dyntick-idle mode? Couldn't they both attempt to report a quiescent state at the same time, resulting in confusion?
 
Answer:
They will both attempt to acquire the lock on the same leaf rcu_node structure. The first one to acquire the lock will report the quiescent state and clear the appropriate bit, and the second one to acquire the lock will see that this bit has already been cleared.

Quick Quiz [*].14: 
But what if all the CPUs end up in dyntick-idle mode? Wouldn't that prevent the current RCU grace period from ever ending?
 
Answer:
Indeed it will! However, CPUs that have RCU callbacks are not permitted to enter dyntick-idle mode, so the only way that all the CPUs could possibly end up in dyntick-idle mode would be if there were absolutely no RCU callbacks in the system. And if there are no RCU callbacks in the system, then there is no need for the RCU grace period to end. In fact, there is no need for the RCU grace period to even start.

RCU will restart if some irq handler does a call_rcu(), which will cause an RCU callback to appear on the corresponding CPU, which will force that CPU out of dyntick-idle mode, which will in turn permit the current RCU grace period to come to an end.

Quick Quiz [*].15: 
Given that force_quiescent_state() is a three-phase state machine, don't we have triple the scheduling latency due to scanning all the CPUs?
 
Answer:
Ah, but the three phases will not execute back-to-back on the same CPU, and, furthermore, the first (initialization) phase doesn't do any scanning. Therefore, the scheduling-latency hit of the three-phase algorithm is no different than that of a single-phase algorithm. If the scheduling latency becomes a problem, one approach would be to recode the state machine to scan the CPUs incrementally, most likely by keeping state on a per-leaf-rcu_node basis. But first show me a problem in the real world, then I will consider fixing it!

Quick Quiz [*].16: 
But the other reason to hold ->onofflock is to prevent multiple concurrent online/offline operations, right?
 
Answer:
Actually, no! The CPU-hotplug code's synchronization design prevents multiple concurrent CPU online/offline operations, so only one CPU online/offline operation can be executing at any given time. Therefore, the only purpose of ->onofflock is to prevent a CPU online or offline operation from running concurrently with grace-period initialization.

Quick Quiz [*].17: 
Given all these acquisitions of the global ->onofflock, won't there be horrible lock contention when running with thousands of CPUs?
 
Answer:
Actually, there can be only three acquisitions of this lock per grace period, and each grace period lasts many milliseconds. One of the acquisitions is by the CPU initializing for the current grace period, and the other two onlining and offlining some CPU. These latter two cannot run concurrently due to the CPU-hotplug locking, so at most two CPUs can be contending for this lock at any given time.

Lock contention on ->onofflock should therefore be no problem, even on systems with thousands of CPUs.

Quick Quiz [*].18: 
Why not simplify the code by merging the detection of dyntick-idle CPUs with that of offline CPUs?
 
Answer:
It might well be that such merging may eventually be the right thing to do. In the meantime, however, there are some challenges:

  1. CPUs are not allowed to go into dyntick-idle mode while they have RCU callbacks pending, but CPUs are allowed to go offline with callbacks pending. This means that CPUs going offline need to have their callbacks migrated to some other CPU, thus, we cannot allow CPUs to simply go quietly offline.
  2. Present-day Linux systems run with NR_CPUS much larger than the actual number of CPUs. A unified approach could thus end up uselessly waiting on CPUs that are not just offline, but which never existed in the first place.
  3. RCU is already operational when CPUs get onlined one at a time during boot, and therefore must handle the online process. This onlining must exclude grace-period initialization, so the ->onofflock must still be used.
  4. CPUs often switch into and out of dyntick-idle mode extremely frequently, so it is not reasonable to use the heavyweight online/offline code path for entering and exiting dyntick-idle mode.

Quick Quiz [*].19: 
Why not simply disable bottom halves (softirq) when acquiring the rcu_data structure's lock? Wouldn't this be faster?
 
Answer:
Because this lock can be acquired from functions called by call_rcu(), which in turn can be invoked from irq handlers. Therefore, irqs must be disabled when holding this lock.

Quick Quiz [*].20: 
How about the qsmask and qsmaskinit fields for the leaf rcu_node structures? Doesn't there have to be some way to work out which of the bits in these fields corresponds to each CPU covered by the rcu_node structure in question?
 
Answer:
Indeed there does! The grpmask field in each CPU's rcu_data structure does this job.

Quick Quiz [*].21: 
But why bother setting qs_pending to one when a CPU is coming online, given that being offline is an extended quiescent state that should cover any ongoing grace period?
 
Answer:
Because this helps to resolve a race between a CPU coming online just as a new grace period is starting.

Quick Quiz [*].22: 
Why record the last completed grace period number in passed_quiesc_completed? Doesn't that cause this RCU implementation to be vulnerable to quiescent states seen while no grace period was in progress being incorrectly applied to the next grace period that starts?
 
Answer:
We record the last completed grace period number in order to avoid races where a quiescent state noted near the end of one grace period is incorrectly applied to the next grace period, especially for dyntick and CPU-offline grace periods. Therefore, force_quiescent_state() and friends all check the last completed grace period number to avoid such races.

Now these dyntick and CPU-offline grace periods are only checked for when a grace period is actually active. The only quiescent states that can be recorded when no grace period is in progress are self-detected quiescent states, which are recorded in the passed_quiesc_completed, passed_quiesc, and qs_pending. These variables are initialized every time the corresponding CPU notices that a new grace period has started, preventing any obsolete quiescent states from being applied to the new grace period.

All that said, optimizing grace-period latency may require that gpnum be tracked in addition to completed.

Quick Quiz [*].23: 
What is the point of running a system with NR_CPUS way bigger than the actual number of CPUs?
 
Answer:
Because this allows producing a single binary of the Linux kernel that runs on a wide variety of systems, greatly easing administration and validation.

Quick Quiz [*].24: 
Why not simply have multiple lists rather than this funny multi-tailed list?
 
Answer:
Because this multi-tailed approach, due to Lai Jiangshan, simplifies callback processing.

Quick Quiz [*].25: 
So some poor CPU has to note quiescent states on behalf of each and every offline CPU? Yecch! Won't that result in excessive overheads in the not-uncommon case of a system with a small number of CPUs but a large value for NR_CPUS?
 
Answer:
Actually, no it will not!

Offline CPUs are excluded from both the qsmask and qsmaskinit bit masks, so RCU normally ignores them. However, there are races with online/offline operations that can result in an offline CPU having its qsmask bit set. These races must of course be handled correctly, and the way they are handled is to permit other CPUs to note that RCU is waiting on a quiescent state from an offline CPU.

Quick Quiz [*].26: 
So what guards the earlier fields in this structure?
 
Answer:
Nothing does, as they are constants set at compile time or boot time. Of course, the fields internal to each rcu_node in the ->node array may change, but they are guarded separately.

Quick Quiz [*].27: 
I thought that RCU read-side processing was supposed to be fast! The functions shown in Figure [*] have so much junk in them that they just have to be slow! What gives here?
 
Answer:
Appearances can be deceiving. The preempt_disable(), preempt_enable(), local_bh_disable(), and local_bh_enable() each do a single non-atomic manipulation of local data. Even that assumes CONFIG_PREEMPT, otherwise, the preempt_disable() and preempt_enable() functions emit no code, not even compiler directives. The __acquire() and __release() functions emit no code (not even compiler directives), but are instead used by the sparse semantic-parsing bug-finding program. Finally, rcu_read_acquire() and rcu_read_release() emit no code (not even compiler directives) unless the ``lockdep'' lock-order debugging facility is enabled, in which case they can indeed be somewhat expensive.

In short, unless you are a kernel hacker who has enabled debugging options, these functions are extremely cheap, and in some cases, absolutely free of overhead. And, in the words of a Portland-area furniture retailer, ``free is a very good price''.

Quick Quiz [*].28: 
Why not simply use __get_cpu_var() to pick up a reference to the current CPU's rcu_data structure on line 13 in Figure [*]?
 
Answer:
Because we might be called either from call_rcu() (in which case we would need __get_cpu_var(rcu_data)) or from call_rcu_bh() (in which case we would need __get_cpu_var(rcu_bh_data)). Using the ->rda[] array of whichever rcu_state structure we were passed works correctly regardless of which API __call_rcu() was invoked from (suggested by Lai Jiangshan [Jia08]).

Quick Quiz [*].29: 
Given that rcu_pending() is always called twice on lines 29-32 of Figure [*], shouldn't there be some way to combine the checks of the two structures?
 
Answer:
Sorry, but this was a trick question. The C language's short-circuit boolean expression evaluation means that __rcu_pending() is invoked on rcu_bh_state only if the prior invocation on rcu_state returns zero.

The reason the two calls are in this order is that ``rcu'' is used more heavily than is ``rcu_bh'', so the first call is more likely to return non-zero than is the second.

Quick Quiz [*].30: 
Shouldn't line 42 of Figure [*] also check for in_hardirq()?
 
Answer:
No. The rcu_read_lock_bh() primitive disables softirq, not hardirq. Because call_rcu_bh() need only wait for pre-existing ``rcu_bh'' read-side critical sections to complete, we need only check in_softirq().

Quick Quiz [*].31: 
But don't we also need to check that a grace period is actually in progress in __rcu_process_callbacks in Figure [*]?
 
Answer:
Indeed we do! And the first thing that force_quiescent_state() does is to perform exactly that check.

Quick Quiz [*].32: 
What happens if two CPUs attempt to start a new grace period concurrently in Figure [*]?
 
Answer:
One of the CPUs will be the first to acquire the root rcu_node structure's lock, and that CPU will start the grace period. The other CPU will then acquire the lock and invoke rcu_start_gp(), which, seeing that a grace period is already in progress, will immediately release the lock and return.

Quick Quiz [*].33: 
How does the code traverse a given path through the rcu_node hierarchy from root to leaves?
 
Answer:
It turns out that the code never needs to do such a traversal, so there is nothing special in place to handle this.

Quick Quiz [*].34: 
C-preprocessor macros are so 1990s! Why not get with the times and convert RCU_DATA_PTR_INIT() in Figure [*] to be a function?
 
Answer:
Because, although it is possible to pass a reference to a particular CPU's instance of a per-CPU variable to a function, there does not appear to be a good way pass a reference to the full set of instances of a given per-CPU variable to a function. One could of course build an array of pointers, then pass a reference to the array in, but that is part of what the RCU_DATA_PTR_INIT() macro is doing in the first place.

Quick Quiz [*].35: 
What happens if a CPU comes online between the time that the last online CPU is notified on lines 25-26 of Figure [*] and the time that register_cpu_notifier() is invoked on line 27?
 
Answer:
Only one CPU is online at this point, so the only way another CPU can come online is if this CPU puts it online, which it is not doing.

Quick Quiz [*].36: 
Why call cpu_quiet() on line 41 of Figure [*], given that we are excluding grace periods with various locks, and given that any earlier grace periods would not have been waiting on this previously-offlined CPU?
 
Answer:
A new grace period might have started just after the ->onofflock was released on line 40. The cpu_quiet() will help expedite such a grace period.

Quick Quiz [*].37: 
But what if the rcu_node hierarchy has only a single structure, as it would on a small system? What prevents concurrent grace-period initialization in that case, given the code in Figure [*]?
 
Answer:
The later acquisition of the sole rcu_node structure's ->lock on line 16 excludes grace-period initialization, which must acquire this same lock in order to initialize this sole rcu_node structure for the new grace period.

The ->onofflock is needed only for multi-node hierarchies, and is used in that case as an alternative to acquiring and holding all of the rcu_node structures' ->lock fields, which would be incredibly painful on large systems.

Quick Quiz [*].38: 
But does line 25 of Figure [*] ever really exit the loop? Why or why not?
 
Answer:
The only way that line 25 could exit the loop is if all CPUs were to be put offline. This cannot happen in the Linux kernel as of 2.6.28, though other environments have been designed to offline all CPUs during the normal shutdown procedure.

Quick Quiz [*].39: 
Suppose that line 26 got executed seriously out of order in Figure [*], so that lastcomp is set to some prior grace period, but so that the current grace period is still waiting on the now-offline CPU? In this case, won't the call to cpu_quiet() fail to report the quiescent state, thus causing the grace period to wait forever for this now-offline CPU?
 
Answer:
First, the lock acquisitions on lines 16 and 12 would prevent the execution of line 26 from being pushed that far out of order. Nevertheless, even if line 26 managed to be misordered that dramatically, what would happen is that force_quiescent_state() would eventually be invoked, and would notice that the current grace period was waiting for a quiescent state from an offline CPU. Then force_quiescent_state() would report the extended quiescent state on behalf of the offlined CPU.

Quick Quiz [*].40: 
Given that an offline CPU is in an extended quiescent state, why does line 28 of Figure [*] need to care which grace period it is dealing with?
 
Answer:
It really does not need to care in this case. However, because it does need to care in many other cases, the cpu_quiet() function does take the grace-period number as an argument, so some value must be supplied.

Quick Quiz [*].41: 
But this list movement in Figure [*] makes all of the going-offline CPU's callbacks go through another grace period, even if they were ready to invoke. Isn't that inefficient? Furthermore, couldn't an unfortunate pattern of CPUs going offline then coming back online prevent a given callback from ever being invoked?
 
Answer:
It is inefficient, but it is simple. Given that this is not a commonly executed code path, this is the right tradeoff. The starvation case would be a concern, except that the online and offline process involves multiple grace periods.

Quick Quiz [*].42: 
Why not just expand note_new_gpnum() inline into check_for_new_grace_period() in Figure [*]?
 
Answer:
Because note_new_gpnum() must be called for each new grace period, including both those started by this CPU and those started by other CPUs. In contrast, check_for_new_grace_period() is called only for the case where some other CPU started the grace period.

Quick Quiz [*].43: 
But there has been no initialization yet at line 15 of Figure [*]! What happens if a CPU notices the new grace period and immediately attempts to report a quiescent state? Won't it get confused?
 
Answer:
There are two cases of interest.

In the first case, there is only a single rcu_node structure in the hierarchy. Since the CPU executing in rcu_start_gp() is currently holding that rcu_node structure's lock, the CPU attempting to report the quiescent state will not be able to acquire this lock until initialization is complete, at which point the quiescent state will be reported normally.

In the second case, there are multiple rcu_node structures, and the leaf rcu_node structure corresponding to the CPU that is attempting to report the quiescent state already has that CPU's ->qsmask bit cleared. Therefore, the CPU attempting to report the quiescent state will give up, and some later quiescent state for that CPU will be applied to the new grace period.

Quick Quiz [*].44: 
Hey! Shouldn't we hold the non-leaf rcu_node structures' locks when munging their state in line 37 of Figure [*]???
 
Answer:
There is no need to hold their locks. The reasoning is as follows:

  1. The new grace period cannot end, because the running CPU (which is initializing it) won't pass through a quiescent state. Therefore, there is no race with another invocation of rcu_start_gp().
  2. The running CPU holds ->onofflock, so there is no race with CPU-hotplug operations.
  3. The leaf rcu_node structures are not yet initialized, so they have all of their ->qsmask bits cleared. This means that any other CPU attempting to report a quiescent state will stop at the leaf level, and thus cannot race with the current CPU for non-leaf rcu_node structures.
  4. The RCU tracing functions access, but do not modify, the rcu_node structures' fields. Races with these functions is therefore harmless.

Quick Quiz [*].45: 
Why can't we merge the loop spanning lines 36-37 with the loop spanning lines 40-44 in Figure [*]?
 
Answer:
If we were to do so, we would either be needlessly acquiring locks for the non-leaf rcu_node structures or would need ugly checks for a given node being a leaf node on each pass through the loop. (Recall that we must acquire the locks for the leaf rcu_node structures due to races with CPUs attempting to report quiescent states.)

Nevertheless, it is quite possible that experience on very large systems will show that such merging is in fact the right thing to do.

Quick Quiz [*].46: 
What prevents lines 11-12 of Figure [*] from reporting a quiescent state from a prior grace period against the current grace period?
 
Answer:
If this could occur, it would be a serious bug, since the CPU in question might be in an RCU read-side critical section that started before the beginning of the current grace period.

There are several cases to consider for the CPU in question:

  1. It remained online and active throughout.
  2. It was in dynticks-idle mode for at least part of the current grace period.
  3. It was offline for at least part of the current grace period.

In the first case, the prior grace period could not have ended without this CPU explicitly reporting a quiescent state, which would leave ->qs_pending zero. This in turn would mean that lines 7-8 would return, so that control would not reach cpu_quiet() unless check_for_new_grace_period() had noted the new grace period. However, if the current grace period had been noted, it would also have set ->passed_quiesc to zero, in which case lines 9-10 would have returned, again meaning that cpu_quiet() would not be invoked. Finally, the only way that ->passed_quiesc could be invoked would be if rcu_check_callbacks() was invoked by a scheduling-clock interrupt that occurred somewhere between lines 5 and 9 of rcu_check_quiescent_state() in Figure [*]. However, this would be a case of a quiescent state occurring in the current grace period, which would be totally legitimate to report against the current grace period. So this case is correctly covered.

In the second case, where the CPU in question spent part of the new quiescent state in dynticks-idle mode, note that dynticks-idle mode is an extended quiescent state, hence it is again permissible to report this quiescent state against the current grace period.

In the third case, where the CPU in question spent part of the new quiescent state offline, note that offline CPUs are in an extended quiescent state, which is again permissible to report against the current grace period.

So quiescent states from prior grace periods are never reported against the current grace period.

Quick Quiz [*].47: 
How do lines 22-23 of Figure [*] know that it is safe to promote the running CPU's RCU callbacks?
 
Answer:
Because the specified CPU has not yet passed through a quiescent state, and because we hold the corresponding leaf node's lock, we know that the current grace period cannot possibly have ended yet. Therefore, there is no danger that any of the callbacks currently queued were registered after the next grace period started, given that they have already been queued and the next grace period has not yet started.

Quick Quiz [*].48: 
Given that argument mask on line 2 of Figure [*] is an unsigned long, how can it possibly deal with systems with more than 64 CPUs?
 
Answer:
Because mask is specific to the specified leaf rcu_node structure, it need only be large enough to represent the CPUs corresponding to that particular rcu_node structure. Since at most 64 CPUs may be associated with a given rcu_node structure (32 CPUs on 32-bit systems), the unsigned long mask argument suffices.

Quick Quiz [*].49: 
How do RCU callbacks on dynticks-idle or offline CPUs get invoked?
 
Answer:
They don't. CPUs with RCU callbacks are not permitted to enter dynticks-idle mode, so dynticks-idle CPUs never have RCU callbacks. When CPUs go offline, their RCU callbacks are migrated to an online CPU, so offline CPUs never have RCU callbacks, either. Thus, there is no need to invoke callbacks on dynticks-idle or offline CPUs.

Quick Quiz [*].50: 
Why would lines 14-17 in Figure [*] need to adjust the tail pointers?
 
Answer:
If any of the tail pointers reference the last callback in the sublist that was ready to invoke, they must be changed to instead reference the ->nxtlist pointer. This situation occurs when the sublists immediately following the ready-to-invoke sublist are empty.

Quick Quiz [*].51: 
But how does the code in Figure [*] handle nested NMIs?
 
Answer:
It does not have to handle nested NMIs, because NMIs do not nest.

Quick Quiz [*].52: 
Why isn't there a memory barrier between lines 8 and 9 of Figure [*]? Couldn't this cause the code to fetch even-numbered values from both the ->dynticks and ->dynticks_nmi fields, even though these two fields never were zero at the same time?
 
Answer:
First, review the code in Figures [*], [*], and [*], and note that dynticks and dynticks_nmi will never have odd values simultaneously (see especially lines 6 and 17 of Figure [*], and recall that interrupts cannot happen from NMIs).

Of course, given the placement of the memory barriers in these functions, it might appear to another CPU that both counters were odd at the same time, but logically this cannot happen, and would indicate that the CPU had in fact passed through dynticks-idle mode.

Now, let's suppose that at the time line 8 fetches ->dynticks, the value of ->dynticks_nmi was at odd number, and that at the time line 9 fetches ->dynticks_nmi, the value of ->dynticks was an odd number. Given that both counters cannot be odd simultaneously, there must have been a time between these two fetches when both counters were even, and thus a time when the CPU was in dynticks-idle mode, which is a quiescent state, as required.

So, why can't the && on line 13 of Figure [*] be replaced with an ==? Well, it could be, but this would likely be more confusing than helpful.

Quick Quiz [*].53: 
Why wait the extra couple jiffies on lines 12-13 in Figure [*]?
 
Answer:
This added delay gives the offending CPU a better chance of reporting on itself, thus getting a decent stack trace of the stalled code. Of course, if the offending CPU is spinning with interrupts disabled, it will never report on itself, so other CPUs do so after a short delay.

Quick Quiz [*].54: 
What prevents the grace period from ending before the stall warning is printed in Figure [*]?
 
Answer:
The caller checked that this CPU still had not reported a quiescent state, and because preemption is disabled, there is no way that a quiescent state could have been reported in the meantime.

Quick Quiz [*].55: 
Why does print_other_cpu_stall() in Figure [*] need to check for the grace period ending when print_cpu_stall() did not?
 
Answer:
The other CPUs might pass through a quiescent state at any time, so the grace period might well have ended in the meantime.

Quick Quiz [*].56: 
Why is it important that blocking primitives called from within a preemptible-RCU read-side critical section be subject to priority inheritance?
 
Answer:
Because blocked readers stall RCU grace periods, which can result in OOM. For example, if a reader did a wait_event() within an RCU read-side critical section, and that event never occurred, then RCU grace periods would stall indefinitely, guaranteeing that the system would OOM sooner or later. There must therefore be some way to cause these readers to progress through their read-side critical sections in order to avoid such OOMs. Priority boosting is one way to force such progress, but only if readers are restricted to blocking such that they can be awakened via priority boosting.

Of course, there are other methods besides priority inheritance that handle the priority inversion problem, including priority ceiling, preemption disabling, and so on. However, there are good reasons why priority inheritance is the approach used in the Linux kernel, so this is what is used for RCU.

Quick Quiz [*].57: 
Could the prohibition against using primitives that would block in a non-CONFIG_PREEMPT kernel be lifted, and if so, under what conditions?
 
Answer:
If testing and benchmarking demonstrated that the preemptible RCU worked well enough that classic RCU could be dispensed with entirely, and if priority inheritance was implemented for blocking synchronization primitives such as semaphores, then those primitives could be used in RCU read-side critical sections.

Quick Quiz [*].58: 
How is it possible for lines 38-43 of __rcu_advance_callbacks() to be executed when lines 7-37 have not? Won't they both be executed just after a counter flip, and never at any other time?
 
Answer:
Consider the following sequence of events:

  1. CPU 0 executes lines 5-12 of rcu_try_flip_idle().
  2. CPU 1 executes __rcu_advance_callbacks(). Because rcu_ctrlblk.completed has been incremented, lines 7-37 execute. However, none of the rcu_flip_flag variables have been set, so lines 38-43 do not execute.
  3. CPU 0 executes lines 13-15 of rcu_try_flip_idle().
  4. Later, CPU 1 again executes __rcu_advance_callbacks(). The counter has not been incremented since the earlier execution, but the rcu_flip_flag variables have all been set, so only lines 38-43 are executed.

Quick Quiz [*].59: 
What problems could arise if the lines containing ACCESS_ONCE() in rcu_read_unlock() were reordered by the compiler?
 
Answer:

  1. If the ACCESS_ONCE() were omitted from the fetch of rcu_flipctr_idx (line 14), then the compiler would be within its rights to eliminate idx. It would also be free to compile the rcu_flipctr decrement as a fetch-increment-store sequence, separately fetching rcu_flipctr_idx for both the fetch and the store. If an NMI were to occur between the fetch and the store, and if the NMI handler contained an rcu_read_lock(), then the value of rcu_flipctr_idx would change in the meantime, resulting in corruption of the rcu_flipctr values, destroying the ability to correctly identify grace periods.
  2. Another failure that could result from omitting the ACCESS_ONCE() from line 14 is due to the compiler reordering this statement to follow the decrement of rcu_read_lock_nesting (line 16). In this case, if an NMI were to occur between these two statements, then any rcu_read_lock() in the NMI handler could corrupt rcu_flipctr_idx, causing the wrong rcu_flipctr to be decremented. As with the analogous situation in rcu_read_lock(), this could result in premature grace-period termination, an indefinite grace period, or even both.
  3. If ACCESS_ONCE() macros were omitted such that the update of rcu_read_lock_nesting could be interchanged by the compiler with the decrement of rcu_flipctr, and if an NMI occurred in between, any rcu_read_lock() in the NMI handler would incorrectly conclude that it was protected by an enclosing rcu_read_lock(), and fail to increment the rcu_flipctr variables.

It is not clear that the ACCESS_ONCE() on the fetch of rcu_read_lock_nesting (line 7) is required.

Quick Quiz [*].60: 
What problems could arise if the lines containing ACCESS_ONCE() in rcu_read_unlock() were reordered by the CPU?
 
Answer:
Absolutely none! The code in rcu_read_unlock() interacts with the scheduling-clock interrupt handler running on the same CPU, and is thus insensitive to reorderings because CPUs always see their own accesses as if they occurred in program order. Other CPUs do access the rcu_flipctr, but because these other CPUs don't access any of the other variables, ordering is irrelevant.

Quick Quiz [*].61: 
What problems could arise in rcu_read_unlock() if irqs were not disabled?
 
Answer:

  1. Disabling irqs has the side effect of disabling preemption. Suppose that this code were to be preempted in the midst of line 17 between selecting the current CPU's copy of the rcu_flipctr array and the decrement of the element indicated by rcu_flipctr_idx. Execution might well resume on some other CPU. If this resumption happened concurrently with an rcu_read_lock() or rcu_read_unlock() running on the original CPU, an increment or decrement might be lost, resulting in either premature termination of a grace period, indefinite extension of a grace period, or even both.
  2. Failing to disable preemption can also defeat RCU priority boosting, which relies on rcu_read_lock_nesting to determine which tasks to boost. If preemption occurred between the update of rcu_read_lock_nesting (line 16) and of rcu_flipctr (line 17), then a grace period might be stalled until this task resumed. But because the RCU priority booster has no way of knowing that this particular task is stalling grace periods, needed boosting will never occur. Therefore, if there are CPU-bound realtime tasks running, the preempted task might never resume, stalling grace periods indefinitely, and eventually resulting in OOM.

Of course, both of these situations could be handled by disabling preemption rather than disabling irqs. (The CPUs I have access to do not show much difference between these two alternatives, but others might.)

Quick Quiz [*].62: 
Suppose that the irq disabling in rcu_read_lock() was replaced by preemption disabling. What effect would that have on GP_STAGES?
 
Answer:
No finite value of GP_STAGES suffices. The following scenario, courtesy of Oleg Nesterov, demonstrates this:

Suppose that low-priority Task A has executed rcu_read_lock() on CPU 0, and thus has incremented per_cpu(rcu_flipctr, 0)[0], which thus has a value of one. Suppose further that Task A is now preempted indefinitely.

Given this situation, consider the following sequence of events:

  1. Task B starts executing rcu_read_lock(), also on CPU 0, picking up the low-order bit of rcu_ctrlblk.completed, which is still equal to zero.
  2. Task B is interrupted by a sufficient number of scheduling-clock interrupts to allow the current grace-period stage to complete, and also be sufficient long-running interrupts to allow the RCU grace-period state machine to advance the rcu_ctrlblk.complete counter so that its bottom bit is now equal to one and all CPUs have acknowledged this increment operation.
  3. CPU 1 starts summing the index==0 counters, starting with per_cpu(rcu_flipctr, 0)[0], which is equal to one due to Task A's increment. CPU 1's local variable sum is therefore equal to one.
  4. Task B returns from interrupt, resuming its execution of rcu_read_lock(), incrementing per_cpu(rcu_flipctr, 0)[0], which now has a value of two.
  5. Task B is migrated to CPU 2.
  6. Task B completes its RCU read-side critical section, and executes rcu_read_unlock(), which decrements per_cpu(rcu_flipctr, 2)[0], which is now -1.
  7. CPU 1 now adds per_cpu(rcu_flipctr, 1)[0] and per_cpu(rcu_flipctr, 2)[0] to its local variable sum, obtaining the value zero.
  8. CPU 1 then incorrectly concludes that all prior RCU read-side critical sections have completed, and advances to the next RCU grace-period stage. This means that some other task might well free up data structures that Task A is still using!

This sequence of events could repeat indefinitely, so that no finite value of GP_STAGES could prevent disrupting Task A. This sequence of events demonstrates the importance of the promise made by CPUs that acknowledge an increment of rcu_ctrlblk.completed, as the problem illustrated by the above sequence of events is caused by Task B's repeated failure to honor this promise.

Therefore, more-pervasive changes to the grace-period state will be required in order for rcu_read_lock() to be able to safely dispense with irq disabling.

Quick Quiz [*].63: 
Why can't the rcu_dereference() precede the memory barrier?
 
Answer:
Because the memory barrier is being executed in an interrupt handler, and interrupts are exact in the sense that a single value of the PC is saved upon interrupt, so that the interrupt occurs at a definite place in the code. Therefore, if the rcu_dereference() were to precede the memory barrier, the interrupt would have had to have occurred after the rcu_dereference(), and therefore the interrupt would also have had to have occurred after the rcu_read_lock() that begins the RCU read-side critical section. This would have forced the rcu_read_lock() to use the earlier value of the grace-period counter, which would in turn have meant that the corresponding rcu_read_unlock() would have had to precede the first "Old counters zero [0]" rather than the second one. This in turn would have meant that the read-side critical section would have been much shorter -- which would have been counter-productive, given that the point of this exercise was to identify the longest possible RCU read-side critical section.

Quick Quiz [*].64: 
What is a more precise way to say "CPU 0 might see CPU 1's increment as early as CPU 1's last previous memory barrier"?
 
Answer:
First, it is important to note that the problem with the less-precise statement is that it gives the impression that there might be a single global timeline, which there is not, at least not for popular microprocessors. Second, it is important to note that memory barriers are all about perceived ordering, not about time. Finally, a more precise way of stating above statement would be as follows: "If CPU 0 loads the value resulting from CPU 1's increment, then any subsequent load by CPU 0 will see the values from any relevant stores by CPU 1 if these stores preceded CPU 1's last prior memory barrier."

Even this more-precise version leaves some wiggle room. The word "subsequent" must be understood to mean "ordered after", either by an explicit memory barrier or by the CPU's underlying memory ordering. In addition, the memory barriers must be strong enough to order the relevant operations. For example, CPU 1's last prior memory barrier must order stores (for example, smp_wmb() or smp_mb()). Similarly, if CPU 0 needs an explicit memory barrier to ensure that its later load follows the one that saw the increment, then this memory barrier needs to be an smp_rmb() or smp_mb().

In general, much care is required when proving parallel algorithms.

Paul E. McKenney 2011-12-16