Figure
(rcu_rcgp.h)
shows the read-side primitives of an RCU implementation that uses a pair
of reference counters (rcu_refcnt[]),
along with a global index that
selects one counter out of the pair (rcu_idx),
a per-thread nesting counter rcu_nesting,
a per-thread snapshot of the global index (rcu_read_idx),
and a global lock (rcu_gp_lock),
which are themselves shown in
Figure
.
The rcu_read_lock() primitive atomically increments the member of the rcu_refcnt[] pair indexed by rcu_idx, and keeps a snapshot of this index in the per-thread variable rcu_read_idx. The rcu_read_unlock() primitive then atomically decrements whichever counter of the pair that the corresponding rcu_read_lock() incremented. However, because only one value of rcu_idx is remembered per thread, additional measures must be taken to permit nesting. These additional measures use the per-thread rcu_nesting variable to track nesting.
To make all this work, line 6 of rcu_read_lock() in
Figure
picks up the
current thread's instance of rcu_nesting, and if line 7 finds
that this is the outermost rcu_read_lock(),
then lines 8-10 pick up the current value of
rcu_idx, save it in this thread's instance of rcu_read_idx,
and atomically increment the selected element of rcu_refcnt.
Regardless of the value of rcu_nesting, line 12 increments it.
Line 13 executes a memory barrier to ensure that the RCU read-side
critical section does not bleed out before the rcu_read_lock() code.
Similarly, the rcu_read_unlock() function executes a memory barrier at line 21 to ensure that the RCU read-side critical section does not bleed out after the rcu_read_unlock() code. Line 22 picks up this thread's instance of rcu_nesting, and if line 23 finds that this is the outermost rcu_read_unlock(), then lines 24 and 25 pick up this thread's instance of rcu_read_idx (saved by the outermost rcu_read_lock()) and atomically decrements the selected element of rcu_refcnt. Regardless of the nesting level, line 27 decrements this thread's instance of rcu_nesting.
Figure
(rcu_rcpg.c)
shows the corresponding synchronize_rcu() implementation.
Lines 6 and 19 acquire and release rcu_gp_lock in order to
prevent more than one concurrent instance of synchronize_rcu().
Lines 7-8 pick up the value of rcu_idx and complement it,
respectively, so that subsequent instances of rcu_read_lock()
will use a different element of rcu_idx that did preceding
instances.
Lines 10-12 then wait for the prior element of rcu_idx to
reach zero, with the memory barrier on line 9 ensuring that the check
of rcu_idx is not reordered to precede the complementing of
rcu_idx.
Lines 13-18 repeat this process, and line 20 ensures that any
subsequent reclamation operations are not reordered to precede the
checking of rcu_refcnt.
Quick Quiz 10.42:
Why the memory barrier on line 5 of synchronize_rcu() in
Figure
given that there is a spin-lock acquisition immediately after?
End Quick Quiz
Quick Quiz 10.43:
Why is the counter flipped twice in
Figure ?
Shouldn't a single flip-and-wait cycle be sufficient?
End Quick Quiz
This implementation avoids the update-starvation issues that could
occur in the single-counter implementation shown in
Figure .
There are still some serious shortcomings.
First, the atomic operations in rcu_read_lock()
and rcu_read_unlock()
are still quite heavyweight.
In fact, they are more complex than those
of the single-counter variant shown in
Figure ,
with the read-side primitives consuming about 150 nanoseconds on a single
Power5 CPU and almost 40 microseconds on a 64-CPU system.
The updates-side synchronize_rcu() primitive is more costly as
well, ranging from about 200 nanoseconds on a single Power5 CPU to
more than 40 microseconds on a 64-CPU system.
This means that the RCU read-side critical sections
have to be extremely long in order to get any real
read-side parallelism.
Second, if there are many concurrent rcu_read_lock() and rcu_read_unlock() operations, there will be extreme memory contention on the rcu_refcnt elements, resulting in expensive cache misses. This further extends the RCU read-side critical-section duration required to provide parallel read-side access. These first two shortcomings defeat the purpose of RCU in most situations.
Third, the need to flip rcu_idx twice imposes substantial overhead on updates, especially if there are large numbers of threads.
Finally, despite the fact that concurrent RCU updates could in principle be satisfied by a common grace period, this implementation serializes grace periods, preventing grace-period sharing.
Quick Quiz 10.44:
Given that atomic increment and decrement are so expensive,
why not just use non-atomic increment on line 10 and a
non-atomic decrement on line 25 of
Figure ?
End Quick Quiz
Despite these shortcomings, one could imagine this variant of RCU being used on small tightly coupled multiprocessors, perhaps as a memory-conserving implementation that maintains API compatibility with more complex implementations. However, it would not not likely scale well beyond a few CPUs.
The next section describes yet another variation on the reference-counting scheme that provides greatly improved read-side performance and scalability.
Paul E. McKenney 2011-12-16