D.4.2.4.1 rcu_read_lock()

Figure: __rcu_read_lock() Implementation
\begin{figure}{ \scriptsize
\begin{verbatim}1 void __rcu_read_lock(void)
2 {...
...) = idx;
18 local_irq_restore(flags);
19 }
20 }\end{verbatim}
}\end{figure}

The implementation of rcu_read_lock() is as shown in Figure [*]. Line 7 fetches this task's RCU read-side critical-section nesting counter. If line 8 finds that this counter is non-zero, then we are already protected by an outer rcu_read_lock(), in which case line 9 simply increments this counter.

However, if this is the outermost rcu_read_lock(), then more work is required. Lines 13 and 18 suppress and restore irqs to ensure that the intervening code is neither preempted nor interrupted by a scheduling-clock interrupt (which runs the grace period state machine). Line 14 fetches the grace-period counter, line 15 increments the current counter for this CPU, line 16 increments the nesting counter, and line 17 records the old/new counter index so that rcu_read_unlock() can decrement the corresponding counter (but on whatever CPU it ends up running on).

The ACCESS_ONCE() macros force the compiler to emit the accesses in order. Although this does not prevent the CPU from reordering the accesses from the viewpoint of other CPUs, it does ensure that NMI and SMI handlers running on this CPU will see these accesses in order. This is critically important:

  1. In absence of the ACCESS_ONCE() in the assignment to idx, the compiler would be within its rights to: (a) eliminate the local variable idx and (b) compile the increment on line 16 as a fetch-increment-store sequence, doing separate accesses to rcu_ctrlblk.completed for the fetch and the store. If the value of rcu_ctrlblk.completed had changed in the meantime, this would corrupt the rcu_flipctr values.
  2. If the assignment to rcu_read_lock_nesting (line 17) were to be reordered to precede the increment of rcu_flipctr (line 16), and if an NMI occurred between these two events, then an rcu_read_lock() in that NMI's handler would incorrectly conclude that it was already under the protection of rcu_read_lock().
  3. If the assignment to rcu_read_lock_nesting (line 17) were to be reordered to follow the assignment to rcu_flipctr_idx (line 18), and if an NMI occurred between these two events, then an rcu_read_lock() in that NMI's handler would clobber rcu_flipctr_idx, possibly causing the matching rcu_read_unlock() to decrement the wrong counter. This in turn could result in premature ending of a grace period, indefinite extension of a grace period, or even both.

It is not clear that the ACCESS_ONCE on the assignment to nesting (line 7) is required. It is also unclear whether the smp_read_barrier_depends() (line 15) is needed: it was added to ensure that changes to index and value remain ordered.

The reasons that irqs must be disabled from line 13 through line 19 are as follows:

  1. Suppose one CPU loaded rcu_ctrlblk.completed (line 14), then a second CPU incremented this counter, and then the first CPU took a scheduling-clock interrupt. The first CPU would then see that it needed to acknowledge the counter flip, which it would do. This acknowledgment is a promise to avoid incrementing the newly old counter, and this CPU would break this promise. Worse yet, this CPU might be preempted immediately upon return from the scheduling-clock interrupt, and thus end up incrementing the counter at some random point in the future. Either situation could disrupt grace-period detection.
  2. Disabling irqs has the side effect of disabling preemption. If this code were to be preempted between fetching rcu_ctrlblk.completed (line 14) and incrementing rcu_flipctr (line 16), it might well be migrated to some other CPU. This would result in it non-atomically incrementing the counter from that other CPU. If this CPU happened to be executing in rcu_read_lock() or rcu_read_unlock() just at that time, one of the increments or decrements might be lost, again disrupting grace-period detection. The same result could happen on RISC machines if the preemption occurred in the middle of the increment (after the fetch of the old counter but before the store of the newly incremented counter).
  3. Permitting preemption in the midst of line 16, between selecting the current CPU's copy of the rcu_flipctr array and the increment of the element indicated by rcu_flipctr_idx, can result in a similar failure. Execution might well resume on some other CPU. If this resumption happened concurrently with an rcu_read_lock() or rcu_read_unlock() running on the original CPU, an increment or decrement might be lost, resulting in either premature termination of a grace period, indefinite extension of a grace period, or even both.
  4. Failing to disable preemption can also defeat RCU priority boosting, which relies on rcu_read_lock_nesting to determine when a given task is in an RCU read-side critical section. So, for example, if a given task is indefinitely preempted just after incrementing rcu_flipctr, but before updating rcu_read_lock_nesting, then it will stall RCU grace periods for as long as it is preempted. However, because rcu_read_lock_nesting has not yet been incremented, the RCU priority booster has no way to tell that boosting is needed. Therefore, in the presence of CPU-bound realtime threads, the preempted task might stall grace periods indefinitely, eventually causing an OOM event.

The last three reasons could of course be addressed by disabling preemption rather than disabling of irqs, but given that the first reason requires disabling irqs in any case, there is little reason to separately disable preemption. It is entirely possible that the first reason might be tolerated by requiring an additional grace-period stage, however, it is not clear that disabling preemption is much faster than disabling interrupts on modern CPUs.

Paul E. McKenney 2011-12-16