D.4.2.4.1 rcu_read_lock()
Figure:
__rcu_read_lock() Implementation
 |
The implementation of rcu_read_lock() is as shown in
Figure
.
Line 7 fetches this task's RCU read-side critical-section nesting
counter.
If line 8 finds that this counter is non-zero,
then we are already protected by an outer
rcu_read_lock(), in which case line 9 simply increments
this counter.
However, if this is the outermost rcu_read_lock(),
then more work is required.
Lines 13 and 18 suppress and restore irqs to ensure that the
intervening code is neither preempted nor interrupted by a
scheduling-clock interrupt (which runs the grace period state machine).
Line 14 fetches the grace-period counter,
line 15 increments the current counter for
this CPU, line 16 increments the nesting counter,
and line 17 records the old/new counter index so that
rcu_read_unlock() can decrement the corresponding
counter (but on whatever CPU it ends up running on).
The ACCESS_ONCE() macros force the compiler to
emit the accesses in order.
Although this does not prevent the CPU from reordering the accesses
from the viewpoint of other CPUs, it does ensure that NMI and
SMI handlers running on this CPU will see these accesses in order.
This is critically important:
- In absence of the ACCESS_ONCE() in the assignment
to idx, the compiler would be within its rights
to: (a) eliminate the local variable idx and
(b) compile the increment on line 16 as a
fetch-increment-store sequence, doing separate accesses to
rcu_ctrlblk.completed for the fetch and the
store.
If the value of rcu_ctrlblk.completed had
changed in the meantime, this would corrupt the
rcu_flipctr values.
- If the assignment to rcu_read_lock_nesting
(line 17) were to be reordered to precede the increment
of rcu_flipctr (line 16), and if an
NMI occurred between these two events, then an
rcu_read_lock() in that NMI's handler
would incorrectly conclude that it was already under the
protection of rcu_read_lock().
- If the assignment to rcu_read_lock_nesting
(line 17) were to be reordered to follow the assignment
to rcu_flipctr_idx (line 18), and if an
NMI occurred between these two events, then an
rcu_read_lock() in that NMI's handler
would clobber rcu_flipctr_idx, possibly
causing the matching rcu_read_unlock() to
decrement the wrong counter.
This in turn could result in premature ending of a
grace period, indefinite extension of a grace period,
or even both.
It is not clear that the ACCESS_ONCE on the assignment to
nesting (line 7) is required.
It is also unclear whether the smp_read_barrier_depends()
(line 15) is needed: it was added to ensure that changes to index
and value remain ordered.
The reasons that irqs must be disabled from line 13 through
line 19 are as follows:
- Suppose one CPU loaded rcu_ctrlblk.completed
(line 14), then a second CPU incremented this counter,
and then the first CPU took a scheduling-clock interrupt.
The first CPU would then see that it needed to acknowledge
the counter flip, which it would do.
This acknowledgment is a promise to avoid incrementing
the newly old counter, and this CPU would break this
promise.
Worse yet, this CPU might be preempted immediately upon
return from the scheduling-clock interrupt, and thus
end up incrementing the counter at some random point
in the future.
Either situation could disrupt grace-period detection.
- Disabling irqs has the side effect of disabling preemption.
If this code were to be preempted between fetching
rcu_ctrlblk.completed (line 14) and
incrementing rcu_flipctr (line 16),
it might well be migrated to some other CPU.
This would result in it non-atomically incrementing
the counter from that other CPU.
If this CPU happened to be executing in rcu_read_lock()
or rcu_read_unlock() just at that time, one
of the increments or decrements might be lost, again
disrupting grace-period detection.
The same result could happen on RISC machines if the preemption
occurred in the middle of the increment (after the fetch of
the old counter but before the store of the newly incremented
counter).
- Permitting preemption in the midst
of line 16, between selecting the current CPU's copy
of the rcu_flipctr array and the increment of
the element indicated by rcu_flipctr_idx, can
result in a similar failure.
Execution might well resume on some other CPU.
If this resumption happened concurrently with an
rcu_read_lock() or rcu_read_unlock()
running on the original CPU,
an increment or decrement might be lost, resulting in either
premature termination of a grace period, indefinite extension
of a grace period, or even both.
- Failing to disable preemption can also defeat RCU priority
boosting, which relies on rcu_read_lock_nesting
to determine when a given task is in an RCU read-side
critical section.
So, for example, if a given task is indefinitely
preempted just after incrementing rcu_flipctr,
but before updating rcu_read_lock_nesting,
then it will stall RCU grace periods for as long as it
is preempted.
However, because rcu_read_lock_nesting has not
yet been incremented, the RCU priority booster has no way
to tell that boosting is needed.
Therefore, in the presence of CPU-bound realtime threads,
the preempted task might stall grace periods indefinitely,
eventually causing an OOM event.
The last three reasons could of course be addressed by disabling
preemption rather than disabling of irqs, but given that the first
reason requires disabling irqs in any case, there is little reason
to separately disable preemption.
It is entirely possible that the first reason might be tolerated
by requiring an additional grace-period stage, however, it is not
clear that disabling preemption is much faster than disabling
interrupts on modern CPUs.
Paul E. McKenney
2011-12-16