Figure
(rcu_rcpl.h)
shows the read-side primitives of an RCU implementation that uses per-thread
pairs of reference counters.
This implementation is quite similar to that shown in
Figure
,
the only difference being that rcu_refcnt is now a per-thread
variable (as shown in
Figure
),
so the rcu_read_lock() and
rcu_read_unlock() primitives no longer perform atomic operations.
Quick Quiz 10.45: Come off it! We can see the atomic_read() primitive in rcu_read_lock()!!! So why are you trying to pretend that rcu_read_lock() contains no atomic operations??? End Quick Quiz
Figure
(rcu_rcpl.c)
shows the implementation of synchronize_rcu(), along with a helper
function named flip_counter_and_wait().
The synchronize_rcu() function resembles that shown in
Figure
,
except that the repeated counter flip is replaced by a pair of calls
on lines 22 and 23 to the new helper function.
The new flip_counter_and_wait() function updates the rcu_idx variable on line 5, executes a memory barrier on line 6, then lines 7-11 spin on each thread's prior rcu_refcnt element, waiting for it to go to zero. Once all such elements have gone to zero, it executes another memory barrier on line 12 and returns.
This RCU implementation imposes important new requirements on its software environment, namely, (1) that it be possible to declare per-thread variables, (2) that these per-thread variables be accessible from other threads, and (3) that it is possible to enumerate all threads. These requirements can be met in almost all software environments, but often result in fixed upper bounds on the number of threads. More-complex implementations might avoid such bounds, for example, by using expandable hash tables. Such implementations might dynamically track threads, for example, by adding them on their first call to rcu_read_lock().
Quick Quiz 10.46:
Great, if we have threads, we can have
ten-millisecond
waits (one set per flip_counter_and_wait() invocation,
and even that assumes that we wait only once for each thread.
Don't we need the grace period to complete much more quickly?
End Quick Quiz
This implementation still has several shortcomings. First, the need to flip rcu_idx twice imposes substantial overhead on updates, especially if there are large numbers of threads.
Second, synchronize_rcu() must now examine a number of variables that increases linearly with the number of threads, imposing substantial overhead on applications with large numbers of threads.
Third, as before, although concurrent RCU updates could in principle be satisfied by a common grace period, this implementation serializes grace periods, preventing grace-period sharing.
Finally, as noted in the text, the need for per-thread variables and for enumerating threads may be problematic in some software environments.
That said, the read-side primitives scale very nicely, requiring about 115 nanoseconds regardless of whether running on a single-CPU or a 64-CPU Power5 system. As noted above, the synchronize_rcu() primitive does not scale, ranging in overhead from almost a microsecond on a single Power5 CPU up to almost 200 microseconds on a 64-CPU system. This implementation could conceivably form the basis for a production-quality user-level RCU implementation.
The next section describes an algorithm permitting more efficient concurrent RCU updates.
Paul E. McKenney 2011-12-16