Quick Quiz .2:
Why not permit sleeping in Classic RCU read-side critical sections
by eliminating context switch as a quiescent state, leaving user-mode
execution and idle loop as the remaining quiescent states?
Answer:
This would mean that a system undergoing heavy kernel-mode
execution load (e.g., due to kernel threads) might never
complete a grace period, which
would cause it to exhaust memory sooner or later.
Quick Quiz .3:
Why is it OK to assume that updates separated by
synchronize_sched() will be performed in order?
Answer:
Because this property is required for the synchronize_sched()
aspect of RCU to work at all.
For example, consider a code sequence that removes an object
from a list, invokes synchronize_sched(), then frees
the object.
If this property did not hold, then that object might appear
to be freed before it was
removed from the list, which is precisely the situation that
synchronize_sched() is supposed to prevent!
Quick Quiz .4:
Why must line 17 in synchronize_srcu()
(Figure )
precede the release of the mutex on line 18?
What would have to change to permit these two lines to be
interchanged?
Would such a change be worthwhile?
Why or why not?
Answer:
Suppose that the order was reversed, and that CPU 0
has just reached line 13 of
synchronize_srcu(), while both CPU 1 and CPU 2 start executing
another synchronize_srcu() each, and CPU 3 starts executing a
srcu_read_lock().
Suppose that CPU 1 reaches line 6 of synchronize_srcu()
just before CPU 0 increments the counter on line 13.
Most importantly, suppose that
CPU 3 executes srcu_read_lock()
out of order with the following SRCU read-side critical section,
so that it acquires a reference to some SRCU-protected data
structure before CPU 0 increments sp->completed, but
executes the srcu_read_lock() after CPU 0 does
this increment.
Then CPU 0 will not wait for CPU 3 to complete its SRCU read-side critical section before exiting the ``while'' loop on lines 15-16 and releasing the mutex (remember, the CPU could be reordering the code).
Now suppose that CPU 2 acquires the mutex next, and again increments sp->completed. This CPU will then have to wait for CPU 3 to exit its SRCU read-side critical section before exiting the loop on lines 15-16 and releasing the mutex. But suppose that CPU 3 again executes out of order, completing the srcu_read_unlock() prior to executing a final reference to the pointer it obtained when entering the SRCU read-side critical section.
CPU 1 will then acquire the mutex, but see that the sp->completed counter has incremented twice, and therefore take the early exit. The caller might well free up the element that CPU 3 is still referencing (due to CPU 3's out-of-order execution).
To prevent this perhaps improbable, but entirely possible, scenario, the final synchronize_sched() must precede the mutex release in synchronize_srcu().
Another approach would be to change to comparison on line 7 of synchronize_srcu() to check for at least three increments of the counter. However, such a change would increase the latency of a ``bulk update'' scenario, where a hash table is being updated or unloaded using multiple threads. In the current code, the latency of the resulting concurrent synchronize_srcu() calls would take at most two SRCU grace periods, while with this change, three would be required.
More experience will be required to determine which approach is really better. For one thing, there must first be some use of SRCU with multiple concurrent updaters.
Quick Quiz .5:
Wait a minute!
With all those new locks, how do you avoid deadlock?
Answer:
Deadlock is avoided by never holding more than one of the
rcu_node structures' locks at a given time.
This algorithm uses two more locks, one to prevent CPU hotplug
operations from running concurrently with grace-period advancement
(onofflock) and another
to permit only one CPU at a time from forcing a quiescent state
to end quickly (fqslock).
These are subject to a locking hierarchy, so that
fqslock must be acquired before
onofflock, which in turn must be acquired before
any of the rcu_node structures' locks.
Also, as a practical matter, refusing to ever hold more than one of the rcu_node locks means that it is unnecessary to track which ones are held. Such tracking would be painful as well as unnecessary.
Quick Quiz .6:
Why stop at a 64-times reduction?
Why not go for a few orders of magnitude instead?
Answer:
RCU works with no problems on
systems with a few hundred CPUs, so allowing 64 CPUs to contend on
a single lock leaves plenty of headroom.
Keep in mind that these locks are acquired quite rarely, as each
CPU will check in about one time per grace period, and grace periods
extend for milliseconds.
Quick Quiz .7:
But I don't care about McKenney's lame excuses in the answer to
Quick Quiz 2!!!
I want to get the number of CPUs contending on a single lock down
to something reasonable, like sixteen or so!!!
Answer:
OK, have it your way, then!
Set CONFIG_RCU_FANOUT=16 and (for NR_CPUS=4096)
you will get a
three-level hierarchy with with 256 rcu_node structures
at the lowest level, 16 rcu_node structures as intermediate
nodes, and a single root-level rcu_node.
The penalty you will pay is that more rcu_node structures
will need to be scanned when checking to see which CPUs need help
completing their quiescent states (256 instead of only 64).
Quick Quiz .8:
OK, so what is the story with the colors?
Answer:
Data structures analogous to rcu_state (including
rcu_ctrlblk) are yellow,
those containing the bitmaps used to determine when CPUs have checked
in are pink,
and the per-CPU rcu_data structures are blue.
The data structures used to conserve energy
(such as rcu_dynticks) will be colored green.
Quick Quiz .9:
Given such an egregious bug, why does Linux run at all?
Answer:
Because the Linux kernel contains device drivers that are (relatively)
well behaved.
Few if any of them spin in RCU read-side critical sections for the
many milliseconds that would be required to provoke this bug.
The bug nevertheless does need to be fixed, and this variant of
RCU does fix it.
Quick Quiz .10:
But doesn't this state diagram indicate that dyntick-idle CPUs will
get hit with reschedule IPIs? Won't that wake them up?
Answer:
No.
Keep in mind that RCU is handling groups of CPUs.
One particular group might contain both dyntick-idle CPUs and
CPUs in normal mode that have somehow managed to avoid passing through
a quiescent state.
Only the latter group will be sent a reschedule IPI; the dyntick-idle
CPUs will merely be marked as being in an extended quiescent state.
Quick Quiz .11:
But what happens if a CPU tries to report going through a quiescent
state (by clearing its bit) before the bit-setting CPU has finished?
Answer:
There are three cases to consider here:
So, in all three cases, the potential race is resolved correctly.
Quick Quiz .12:
And what happens if all CPUs try to report going
through a quiescent
state before the bit-setting CPU has finished, thus ending the new
grace period before it starts?
Answer:
The bit-setting CPU cannot pass through a
quiescent state during initialization, as it has irqs disabled.
Its bits therefore remain non-zero, preventing the grace period from
ending until the data structure has been fully initialized.
Quick Quiz .13:
And what happens if one CPU comes out of dyntick-idle mode and then
passed through a quiescent state just as another CPU notices that the
first CPU was in dyntick-idle mode?
Couldn't they both attempt to report a quiescent state at the same
time, resulting in confusion?
Answer:
They will both attempt to acquire the lock on the same leaf
rcu_node structure.
The first one to acquire the lock will report the quiescent state
and clear the appropriate bit, and the second one to acquire the
lock will see that this bit has already been cleared.
Quick Quiz .14:
But what if all the CPUs end up in dyntick-idle mode?
Wouldn't that prevent the current RCU grace period from ever ending?
Answer:
Indeed it will!
However, CPUs that have RCU callbacks are not permitted to enter
dyntick-idle mode, so the only way that all the CPUs could
possibly end up in dyntick-idle mode would be if there were
absolutely no RCU callbacks in the system.
And if there are no RCU callbacks in the system, then there is no
need for the RCU grace period to end.
In fact, there is no need for the RCU grace period to even
start.
RCU will restart if some irq handler does a call_rcu(), which will cause an RCU callback to appear on the corresponding CPU, which will force that CPU out of dyntick-idle mode, which will in turn permit the current RCU grace period to come to an end.
Quick Quiz .15:
Given that force_quiescent_state() is a three-phase state
machine, don't we have triple the scheduling latency due to scanning
all the CPUs?
Answer:
Ah, but the three phases will not execute back-to-back on the same CPU,
and, furthermore, the first (initialization) phase doesn't do any
scanning.
Therefore, the scheduling-latency hit of the three-phase algorithm
is no different than that of a single-phase algorithm.
If the scheduling latency becomes a problem, one approach would be to
recode the state machine to scan the CPUs incrementally, most likely
by keeping state on a per-leaf-rcu_node basis.
But first show me a problem in the real world, then
I will consider fixing it!
Quick Quiz .16:
But the other reason to hold ->onofflock is to prevent
multiple concurrent online/offline operations, right?
Answer:
Actually, no!
The CPU-hotplug code's synchronization design prevents multiple
concurrent CPU online/offline operations, so only one CPU
online/offline operation can be executing at any given time.
Therefore, the only purpose of ->onofflock is to prevent a CPU
online or offline operation from running concurrently with grace-period
initialization.
Quick Quiz .17:
Given all these acquisitions of the global ->onofflock,
won't there
be horrible lock contention when running with thousands of CPUs?
Answer:
Actually, there can be only three acquisitions of this lock per grace
period, and each grace period lasts many milliseconds.
One of the acquisitions is by the CPU initializing for the current
grace period, and the other two onlining and offlining some CPU.
These latter two cannot run concurrently due to the CPU-hotplug
locking, so at most two CPUs can be contending for this lock at any
given time.
Lock contention on ->onofflock should therefore be no problem, even on systems with thousands of CPUs.
Quick Quiz .18:
Why not simplify the code by merging the detection of dyntick-idle
CPUs with that of offline CPUs?
Answer:
It might well be that such merging may eventually be the right
thing to do.
In the meantime, however, there are some challenges:
Quick Quiz .19:
Why not simply disable bottom halves (softirq) when acquiring
the rcu_data structure's lock?
Wouldn't this be faster?
Answer:
Because this lock can be acquired from functions
called by call_rcu(), which in turn can be
invoked from irq handlers.
Therefore, irqs must be disabled when
holding this lock.
Quick Quiz .20:
How about the qsmask and qsmaskinit
fields for the leaf rcu_node structures?
Doesn't there have to be some way to work out
which of the bits in these fields corresponds
to each CPU covered by the rcu_node structure
in question?
Answer:
Indeed there does!
The grpmask field in each CPU's rcu_data
structure does this job.
Quick Quiz .21:
But why bother setting qs_pending to one when a CPU
is coming online, given that being offline is an extended
quiescent state that should cover any ongoing grace period?
Answer:
Because this helps to resolve a race between a CPU coming online
just as a new grace period is starting.
Quick Quiz .22:
Why record the last completed grace period number in
passed_quiesc_completed?
Doesn't that cause this RCU implementation to be vulnerable
to quiescent states seen while no grace period was in progress
being incorrectly applied to the next grace period that starts?
Answer:
We record the last completed grace period number in order
to avoid races where a quiescent state noted near the end of
one grace period is incorrectly applied to the next grace
period, especially for dyntick and CPU-offline grace periods.
Therefore, force_quiescent_state() and friends all
check the last completed grace period number to avoid such races.
Now these dyntick and CPU-offline grace periods are only checked for when a grace period is actually active. The only quiescent states that can be recorded when no grace period is in progress are self-detected quiescent states, which are recorded in the passed_quiesc_completed, passed_quiesc, and qs_pending. These variables are initialized every time the corresponding CPU notices that a new grace period has started, preventing any obsolete quiescent states from being applied to the new grace period.
All that said, optimizing grace-period latency may require that gpnum be tracked in addition to completed.
Quick Quiz .23:
What is the point of running a system with NR_CPUS
way bigger than the actual number of CPUs?
Answer:
Because this allows producing a single binary of the Linux kernel
that runs on a wide variety of systems, greatly easing administration
and validation.
Quick Quiz .24:
Why not simply have multiple lists rather than this funny
multi-tailed list?
Answer:
Because this multi-tailed approach, due to Lai Jiangshan,
simplifies callback processing.
Quick Quiz .25:
So some poor CPU has to note quiescent states on behalf of
each and every offline CPU?
Yecch!
Won't that result in excessive overheads in the not-uncommon
case of a system with a small number of CPUs but a large value
for NR_CPUS?
Answer:
Actually, no it will not!
Offline CPUs are excluded from both the qsmask and qsmaskinit bit masks, so RCU normally ignores them. However, there are races with online/offline operations that can result in an offline CPU having its qsmask bit set. These races must of course be handled correctly, and the way they are handled is to permit other CPUs to note that RCU is waiting on a quiescent state from an offline CPU.
Quick Quiz .26:
So what guards the earlier fields in this structure?
Answer:
Nothing does, as they are constants set at compile time
or boot time.
Of course, the fields internal to each rcu_node
in the ->node array may change, but they are
guarded separately.
Quick Quiz .27:
I thought that RCU read-side processing was supposed to
be fast!
The functions shown in
Figure
have so much junk in them that they just have to be slow!
What gives here?
Answer:
Appearances can be deceiving.
The preempt_disable(), preempt_enable(),
local_bh_disable(), and local_bh_enable() each
do a single non-atomic manipulation of local data.
Even that assumes CONFIG_PREEMPT, otherwise,
the preempt_disable() and preempt_enable()
functions emit no code, not even compiler directives.
The __acquire() and __release() functions
emit no code (not even compiler directives), but are instead
used by the sparse semantic-parsing bug-finding program.
Finally, rcu_read_acquire() and rcu_read_release()
emit no code (not even compiler directives) unless the
``lockdep'' lock-order debugging facility is enabled, in
which case they can indeed be somewhat expensive.
In short, unless you are a kernel hacker who has enabled debugging options, these functions are extremely cheap, and in some cases, absolutely free of overhead. And, in the words of a Portland-area furniture retailer, ``free is a very good price''.
Quick Quiz .28:
Why not simply use __get_cpu_var() to pick up a
reference to the
current CPU's rcu_data structure on line 13 in
Figure ?
Answer:
Because we might be called either from call_rcu()
(in which case we would need __get_cpu_var(rcu_data))
or from call_rcu_bh() (in which case we would need
__get_cpu_var(rcu_bh_data)).
Using the ->rda[] array of whichever
rcu_state structure we were passed works correctly
regardless of which API __call_rcu() was invoked from
(suggested by Lai Jiangshan [Jia08]).
Quick Quiz .29:
Given that rcu_pending() is always called twice
on lines 29-32 of
Figure ,
shouldn't there be some way to combine the checks of the
two structures?
Answer:
Sorry, but this was a trick question.
The C language's short-circuit boolean expression evaluation
means that __rcu_pending() is invoked on
rcu_bh_state only if the prior invocation on
rcu_state returns zero.
The reason the two calls are in this order is that ``rcu'' is used more heavily than is ``rcu_bh'', so the first call is more likely to return non-zero than is the second.
Quick Quiz .30:
Shouldn't line 42 of
Figure
also check for in_hardirq()?
Answer:
No.
The rcu_read_lock_bh() primitive disables
softirq, not hardirq.
Because call_rcu_bh() need only wait for pre-existing
``rcu_bh'' read-side critical sections to complete,
we need only check in_softirq().
Quick Quiz .31:
But don't we also need to check that a grace period is
actually in progress in __rcu_process_callbacks in
Figure ?
Answer:
Indeed we do!
And the first thing that force_quiescent_state() does
is to perform exactly that check.
Quick Quiz .32:
What happens if two CPUs attempt to start a new grace
period concurrently in
Figure ?
Answer:
One of the CPUs will be the first to acquire the root
rcu_node structure's lock, and that CPU will start
the grace period.
The other CPU will then acquire the lock and invoke
rcu_start_gp(), which, seeing that a grace period
is already in progress, will immediately release the
lock and return.
Quick Quiz .33:
How does the code traverse a given path through
the rcu_node hierarchy from root to leaves?
Answer:
It turns out that the code never needs to do such a traversal,
so there is nothing special in place to handle this.
Quick Quiz .34:
C-preprocessor macros are so 1990s!
Why not get with the times and convert RCU_DATA_PTR_INIT()
in Figure
to be a function?
Answer:
Because, although it is possible to pass a reference to
a particular CPU's instance of a per-CPU variable to a function,
there does not appear to be a good way pass a reference to
the full set of instances of a given per-CPU variable to
a function.
One could of course build an array of pointers, then pass a
reference to the array in, but that is part of what
the RCU_DATA_PTR_INIT() macro is doing in the first place.
Quick Quiz .35:
What happens if a CPU comes online between the time
that the last online CPU is notified on lines 25-26 of
Figure
and the time that register_cpu_notifier() is invoked
on line 27?
Answer:
Only one CPU is online at this point, so the only way another
CPU can come online is if this CPU puts it online, which it
is not doing.
Quick Quiz .36:
Why call cpu_quiet() on line 41 of
Figure ,
given that we are excluding grace periods with various
locks, and given that any earlier grace periods would not have
been waiting on this previously-offlined CPU?
Answer:
A new grace period might have started just after the
->onofflock was released on line 40.
The cpu_quiet() will help expedite such a grace period.
Quick Quiz .37:
But what if the rcu_node hierarchy has only a single
structure, as it would on a small system?
What prevents concurrent grace-period initialization in that
case, given the code in
Figure ?
Answer:
The later acquisition of the sole rcu_node structure's
->lock on line 16 excludes grace-period initialization,
which must acquire this same lock in order to initialize this
sole rcu_node structure for the new grace period.
The ->onofflock is needed only for multi-node hierarchies, and is used in that case as an alternative to acquiring and holding all of the rcu_node structures' ->lock fields, which would be incredibly painful on large systems.
Quick Quiz .38:
But does line 25 of
Figure
ever really exit the loop?
Why or why not?
Answer:
The only way that line 25 could exit the loop is if all
CPUs were to be put offline.
This cannot happen in the Linux kernel as of 2.6.28, though
other environments have been designed to offline all CPUs
during the normal shutdown procedure.
Quick Quiz .39:
Suppose that line 26 got executed seriously out of order in
Figure ,
so that lastcomp is set to some prior grace period, but
so that the current grace period is still waiting on the
now-offline CPU?
In this case, won't the call to cpu_quiet() fail to
report the quiescent state, thus causing the grace period
to wait forever for this now-offline CPU?
Answer:
First, the lock acquisitions on lines 16 and 12 would prevent
the execution of line 26 from being pushed that far out of
order.
Nevertheless, even if line 26 managed to be misordered that
dramatically, what would happen is that force_quiescent_state()
would eventually be invoked, and would notice that the current
grace period was waiting for a quiescent state from an offline
CPU.
Then force_quiescent_state() would report the extended
quiescent state on behalf of the offlined CPU.
Quick Quiz .40:
Given that an offline CPU is in an extended quiescent state,
why does line 28 of
Figure
need to care which grace period it is
dealing with?
Answer:
It really does not need to care in this case.
However, because it does need to care in many other
cases, the cpu_quiet() function does take the
grace-period number as an argument, so some value must be
supplied.
Quick Quiz .41:
But this list movement in
Figure
makes all of the going-offline CPU's callbacks go through
another grace period, even if they were ready to invoke.
Isn't that inefficient?
Furthermore, couldn't an unfortunate pattern of CPUs going
offline then coming back online prevent a given callback from
ever being invoked?
Answer:
It is inefficient, but it is simple.
Given that this is not a commonly executed code path, this
is the right tradeoff.
The starvation case would be a concern, except that the
online and offline process involves multiple grace periods.
Quick Quiz .42:
Why not just expand note_new_gpnum() inline into
check_for_new_grace_period() in
Figure ?
Answer:
Because note_new_gpnum() must be called for each new
grace period, including both those started by this CPU and
those started by other CPUs.
In contrast, check_for_new_grace_period() is called only
for the case where some other CPU started the grace period.
Quick Quiz .43:
But there has been no initialization yet at line 15 of
Figure !
What happens if a CPU notices the new grace period and
immediately attempts to report a quiescent state?
Won't it get confused?
Answer:
There are two cases of interest.
In the first case, there is only a single rcu_node structure in the hierarchy. Since the CPU executing in rcu_start_gp() is currently holding that rcu_node structure's lock, the CPU attempting to report the quiescent state will not be able to acquire this lock until initialization is complete, at which point the quiescent state will be reported normally.
In the second case, there are multiple rcu_node structures, and the leaf rcu_node structure corresponding to the CPU that is attempting to report the quiescent state already has that CPU's ->qsmask bit cleared. Therefore, the CPU attempting to report the quiescent state will give up, and some later quiescent state for that CPU will be applied to the new grace period.
Quick Quiz .44:
Hey!
Shouldn't we hold the non-leaf rcu_node structures'
locks when munging their state in line 37 of
Figure ???
Answer:
There is no need to hold their locks.
The reasoning is as follows:
Quick Quiz .45:
Why can't we merge the loop spanning lines 36-37 with
the loop spanning lines 40-44 in
Figure ?
Answer:
If we were to do so, we would either be needlessly acquiring locks
for the non-leaf rcu_node structures or would need
ugly checks for a given node being a leaf node on each pass
through the loop.
(Recall that we must acquire the locks for the leaf
rcu_node structures due to races with CPUs attempting
to report quiescent states.)
Nevertheless, it is quite possible that experience on very large systems will show that such merging is in fact the right thing to do.
Quick Quiz .46:
What prevents lines 11-12 of
Figure
from reporting a quiescent state from a prior
grace period against the current grace period?
Answer:
If this could occur, it would be a serious bug, since the
CPU in question might be in an RCU read-side critical section
that started before the beginning of the current grace period.
There are several cases to consider for the CPU in question:
In the first case, the prior grace period could not have
ended without this CPU explicitly reporting a quiescent
state, which would leave ->qs_pending zero.
This in turn would mean that lines 7-8 would return, so
that control would not reach cpu_quiet() unless
check_for_new_grace_period() had noted the new grace
period.
However, if the current grace period had been noted, it would
also have set ->passed_quiesc to zero, in which case
lines 9-10 would have returned, again meaning that cpu_quiet()
would not be invoked.
Finally, the only way that ->passed_quiesc could be invoked
would be if rcu_check_callbacks() was invoked by
a scheduling-clock interrupt that occurred somewhere between
lines 5 and 9 of rcu_check_quiescent_state() in
Figure .
However, this would be a case of a quiescent state occurring
in the current grace period, which would be totally
legitimate to report against the current grace period.
So this case is correctly covered.
In the second case, where the CPU in question spent part of the new quiescent state in dynticks-idle mode, note that dynticks-idle mode is an extended quiescent state, hence it is again permissible to report this quiescent state against the current grace period.
In the third case, where the CPU in question spent part of the new quiescent state offline, note that offline CPUs are in an extended quiescent state, which is again permissible to report against the current grace period.
So quiescent states from prior grace periods are never reported against the current grace period.
Quick Quiz .47:
How do lines 22-23 of
Figure
know that it is safe to promote the running CPU's RCU
callbacks?
Answer:
Because the specified CPU has not yet passed through a quiescent
state, and because we hold the corresponding leaf node's lock,
we know that the current grace period cannot possibly have
ended yet.
Therefore, there is no danger that any of the callbacks currently
queued were registered after the next grace period started, given
that they have already been queued and the next grace period
has not yet started.
Quick Quiz .48:
Given that argument mask on line 2 of
Figure
is an unsigned long, how can it possibly deal with systems
with more than 64 CPUs?
Answer:
Because mask is specific to the specified leaf rcu_node
structure, it need only be large enough to represent the
CPUs corresponding to that particular rcu_node structure.
Since at most 64 CPUs may be associated with a given
rcu_node structure (32 CPUs on 32-bit systems),
the unsigned long mask argument suffices.
Quick Quiz .49:
How do RCU callbacks on dynticks-idle or offline CPUs
get invoked?
Answer:
They don't.
CPUs with RCU callbacks are not permitted to enter dynticks-idle
mode, so dynticks-idle CPUs never have RCU callbacks.
When CPUs go offline, their RCU callbacks are migrated to
an online CPU, so offline CPUs never have RCU callbacks, either.
Thus, there is no need to invoke callbacks on dynticks-idle
or offline CPUs.
Quick Quiz .50:
Why would lines 14-17 in
Figure
need to adjust the tail pointers?
Answer:
If any of the tail pointers reference the last callback
in the sublist that was ready to invoke, they must be
changed to instead reference the ->nxtlist pointer.
This situation occurs when the sublists
immediately following the ready-to-invoke sublist are empty.
Quick Quiz .51:
But how does the code in
Figure
handle nested NMIs?
Answer:
It does not have to handle nested NMIs, because NMIs do not nest.
Quick Quiz .52:
Why isn't there a memory barrier between lines 8 and 9 of
Figure ?
Couldn't this cause the code to fetch even-numbered values
from both the ->dynticks and ->dynticks_nmi fields,
even though these two fields never were zero at the same time?
Answer:
First, review the code in
Figures ,
, and
,
and note that dynticks and dynticks_nmi will never
have odd values simultaneously (see especially lines 6 and 17 of
Figure
,
and recall that interrupts cannot happen from NMIs).
Of course, given the placement of the memory barriers in these functions, it might appear to another CPU that both counters were odd at the same time, but logically this cannot happen, and would indicate that the CPU had in fact passed through dynticks-idle mode.
Now, let's suppose that at the time line 8 fetches ->dynticks, the value of ->dynticks_nmi was at odd number, and that at the time line 9 fetches ->dynticks_nmi, the value of ->dynticks was an odd number. Given that both counters cannot be odd simultaneously, there must have been a time between these two fetches when both counters were even, and thus a time when the CPU was in dynticks-idle mode, which is a quiescent state, as required.
So, why can't the && on line 13 of
Figure
be replaced with an ==?
Well, it could be, but this would likely be more confusing
than helpful.
Quick Quiz .53:
Why wait the extra couple jiffies on lines 12-13 in
Figure ?
Answer:
This added delay gives the offending CPU a better chance of
reporting on itself, thus getting a decent stack trace of
the stalled code.
Of course, if the offending CPU is spinning with interrupts
disabled, it will never report on itself, so other CPUs
do so after a short delay.
Quick Quiz .54:
What prevents the grace period from ending before the
stall warning is printed in
Figure ?
Answer:
The caller checked that this CPU still had not reported a
quiescent state, and because preemption is disabled, there is
no way that a quiescent state could have been reported in
the meantime.
Quick Quiz .55:
Why does print_other_cpu_stall() in
Figure
need to check for the grace period ending when
print_cpu_stall() did not?
Answer:
The other CPUs might pass through a quiescent state at any time,
so the grace period might well have ended in the meantime.
Quick Quiz .56:
Why is it important that blocking primitives
called from within a preemptible-RCU read-side critical section be
subject to priority inheritance?
Answer:
Because blocked readers stall RCU grace periods,
which can result in OOM.
For example, if a reader did a wait_event() within
an RCU read-side critical section, and that event never occurred,
then RCU grace periods would stall indefinitely, guaranteeing that
the system would OOM sooner or later.
There must therefore be some way to cause these readers to progress
through their read-side critical sections in order to avoid such OOMs.
Priority boosting is one way to force such progress, but only if
readers are restricted to blocking such that they can be awakened via
priority boosting.
Of course, there are other methods besides priority inheritance that handle the priority inversion problem, including priority ceiling, preemption disabling, and so on. However, there are good reasons why priority inheritance is the approach used in the Linux kernel, so this is what is used for RCU.
Quick Quiz .57:
Could the prohibition against using primitives
that would block in a non-CONFIG_PREEMPT kernel be lifted,
and if so, under what conditions?
Answer:
If testing and benchmarking demonstrated that the
preemptible RCU worked well enough that classic RCU could be dispensed
with entirely, and if priority inheritance was implemented for blocking
synchronization primitives
such as semaphores, then those primitives could be
used in RCU read-side critical sections.
Quick Quiz .58:
How is it possible for lines 38-43 of
__rcu_advance_callbacks() to be executed when
lines 7-37 have not?
Won't they both be executed just after a counter flip, and
never at any other time?
Answer:
Consider the following sequence of events:
Quick Quiz .59:
What problems could arise if the lines containing
ACCESS_ONCE() in rcu_read_unlock()
were reordered by the compiler?
Answer:
It is not clear that the ACCESS_ONCE() on the fetch of rcu_read_lock_nesting (line 7) is required.
Quick Quiz .60:
What problems could arise if the lines containing
ACCESS_ONCE() in rcu_read_unlock()
were reordered by the CPU?
Answer:
Absolutely none! The code in rcu_read_unlock()
interacts with the scheduling-clock interrupt handler
running on the same CPU, and is thus insensitive to reorderings
because CPUs always see their own accesses as if they occurred
in program order.
Other CPUs do access the rcu_flipctr, but because these
other CPUs don't access any of the other variables, ordering is
irrelevant.
Quick Quiz .61:
What problems could arise in
rcu_read_unlock() if irqs were not disabled?
Answer:
Of course, both of these situations could be handled by disabling preemption rather than disabling irqs. (The CPUs I have access to do not show much difference between these two alternatives, but others might.)
Quick Quiz .62:
Suppose that the irq disabling in
rcu_read_lock() was replaced by preemption disabling.
What effect would that have on GP_STAGES?
Answer:
No finite value of GP_STAGES suffices.
The following scenario, courtesy of Oleg Nesterov, demonstrates this:
Suppose that low-priority Task A has executed rcu_read_lock() on CPU 0, and thus has incremented per_cpu(rcu_flipctr, 0)[0], which thus has a value of one. Suppose further that Task A is now preempted indefinitely.
Given this situation, consider the following sequence of events:
This sequence of events could repeat indefinitely, so that no finite value of GP_STAGES could prevent disrupting Task A. This sequence of events demonstrates the importance of the promise made by CPUs that acknowledge an increment of rcu_ctrlblk.completed, as the problem illustrated by the above sequence of events is caused by Task B's repeated failure to honor this promise.
Therefore, more-pervasive changes to the grace-period state will be required in order for rcu_read_lock() to be able to safely dispense with irq disabling.
Quick Quiz .63:
Why can't the rcu_dereference()
precede the memory barrier?
Answer:
Because the memory barrier is being executed in
an interrupt handler, and interrupts are exact in the sense that
a single value of the PC is saved upon interrupt, so that the
interrupt occurs at a definite place in the code.
Therefore, if the
rcu_dereference() were to precede the memory barrier,
the interrupt would have had to have occurred after the
rcu_dereference(), and therefore
the interrupt would also have had to have occurred after the
rcu_read_lock() that begins the RCU read-side critical
section.
This would have forced the rcu_read_lock() to use
the earlier value of the grace-period counter, which would in turn
have meant that the corresponding rcu_read_unlock()
would have had to precede the first "Old counters zero [0]" rather
than the second one.
This in turn would have meant that the read-side critical section
would have been much shorter -- which would have been
counter-productive,
given that the point of this exercise was to identify the longest
possible RCU read-side critical section.
Quick Quiz .64:
What is a more precise way to say "CPU 0
might see CPU 1's increment as early as CPU 1's last previous
memory barrier"?
Answer:
First, it is important to note that the problem with
the less-precise statement is that it gives the impression that there
might be a single global timeline, which there is not, at least not for
popular microprocessors.
Second, it is important to note that memory barriers are all about
perceived ordering, not about time.
Finally, a more precise way of stating above statement would be as
follows: "If CPU 0 loads the value resulting from CPU 1's
increment, then any subsequent load by CPU 0 will see the
values from any relevant stores by CPU 1 if these stores
preceded CPU 1's last prior memory barrier."
Even this more-precise version leaves some wiggle room. The word "subsequent" must be understood to mean "ordered after", either by an explicit memory barrier or by the CPU's underlying memory ordering. In addition, the memory barriers must be strong enough to order the relevant operations. For example, CPU 1's last prior memory barrier must order stores (for example, smp_wmb() or smp_mb()). Similarly, if CPU 0 needs an explicit memory barrier to ensure that its later load follows the one that saw the increment, then this memory barrier needs to be an smp_rmb() or smp_mb().
In general, much care is required when proving parallel algorithms.
Paul E. McKenney 2011-12-16