Because this implementation of preemptible RCU does not require memory
barriers in rcu_read_lock() and rcu_read_unlock(),
a multi-stage grace-period detection algorithm is required.
Instead of using a single wait queue of callbacks
(which has sufficed for earlier RCU implementations), this implementation
uses an array of wait queues, so that RCU callbacks
are enqueued on each element of this array in turn.
This difference in callback flow is shown in
Figure
for a preemptible RCU implementation with two waitlist stages per grace period
(in contrast,
the September 10 2007 patch to -rt [McK07c]
uses four waitlist stages).
Given two stages per grace period, any pair of stages forms a full grace period. Similarly, in an implementation with four stages per grace period, any sequence of four stages would form a full grace period.
To determine when a grace-period stage can end,
preemptible RCU uses a per-CPU two-element rcu_flipctr array
that tracks in-progress RCU read-side critical sections.
One element of a given CPU's rcu_flipctr array tracks
old RCU read-side critical sections, in other words, critical sections
that started before the current grace-period stage.
The other element tracks new RCU read-side critical
sections, namely those starting during the current grace-period stage.
The array elements switch roles at the beginning of each new grace-period
stage, as shown in
Figure .
During the first stage on the left-hand side of the above figure, rcu_flipctr[0] tracks the new RCU read-side critical sections, and is therefore incremented by rcu_read_lock() and decremented by rcu_read_unlock(). Similarly, rcu_flipctr[1] tracks the old RCU read-side critical sections (those that started during earlier stages), and is therefore decremented by rcu_read_unlock() and never incremented at all.
Because each CPU's old rcu_flipctr[1] elements are never incremented, their sum across all CPUs must eventually go to zero, although preemption in the midst of an RCU read-side critical section might cause any individual counter to remain non-zero or even to go negative. For example, suppose that a task calls rcu_read_lock() on one CPU, is preempted, resumes on another CPU, and then calls rcu_read_unlock(). The first CPU's counter will then be +1 and the second CPU's counter will be -1, however, they will still sum to zero. Regardless of possible preemption, when the sum of the old counter elements does go to zero, it is safe to move to the next grace-period stage, as shown on the right-hand side of the above figure.
In this second stage, the elements of each CPU's rcu_flipctr counter array switch roles. The rcu_flipctr[0] counter now tracks the old RCU read-side critical sections, in other words, the ones that started during grace period stage 0. Similarly, the rcu_flipctr[1] counter now tracks the new RCU read-side critical sections that start in grace period stage 1. Therefore, rcu_read_lock() now increments rcu_flipctr[1], while rcu_read_unlock() still might decrement either counter. Specifically, if the matching rcu_read_lock() executed during grace-period stage 0 (the old stage at this point), then rcu_read_unlock() must decrement rcu_flipctr[0], but if the matching rcu_read_lock() executed during grace-period stage 1 (the new stage), then rcu_read_unlock() must instead decrement rcu_flipctr[1].
The critical point is that all rcu_flipctr elements tracking the old RCU read-side critical sections must strictly decrease. Therefore, once the sum of these old counters reaches zero, it cannot change.
The rcu_read_lock() primitive uses the bottom bit of the current grace-period counter (rcu_ctrlblk.completed & 0x1) to index the rcu_flipctr array, and records this index in the task structure. The matching rcu_read_unlock() uses this recorded value to ensure that it decrements a counter corresponding to the one that the matching rcu_read_lock() incremented. Of course, if the RCU read-side critical section has been preempted, rcu_read_lock() might be decrementing the counter belonging to a different CPU than the one whose counter was incremented by the matching rcu_read_lock().
Each CPU also maintains rcu_flip_flag and rcu_mb_flag per-CPU variables. The rcu_flip_flag variable is used to synchronize the start of each grace-period stage: once a given CPU has responded to its rcu_flip_flag, it must refrain from incrementing the rcu_flip array element that now corresponds to the old grace-period stage. The CPU that advances the counter (rcu_ctrlblk.completed) changes the value of each CPU's rcu_mb_flag to rcu_flipped, but a given rcu_mb_flag may be changed back to rcu_flip_seen only by the corresponding CPU.
The rcu_mb_flag variable is used to force each CPU to execute a memory barrier at the end of each grace-period stage. These memory barriers are required to ensure that memory accesses from RCU read-side critical sections ending in a given grace-period stage are ordered before the end of that stage. This approach gains the benefits memory barriers at the beginning and end of each RCU read-side critical section without having to actually execute all those costly barriers. The rcu_mb_flag is set to rcu_mb_needed by the CPU that detects that the sum of the old counters is zero, but a given rcu_mb_flag is changed back to rcu_mb_done only by the corresponding CPU, and even then only after executing a memory barrier.
Paul E. McKenney 2011-12-16