Figure
(rcu_rcpls.h)
shows the read-side primitives for an RCU implementation using per-thread
reference count pairs, as before, but permitting updates to share
grace periods.
The main difference from the earlier implementation shown in
Figure
is that rcu_idx is now a long that counts freely,
so that line 8 of
Figure
must mask off the low-order bit.
We also switched from using atomic_read() and atomic_set()
to using ACCESS_ONCE().
The data is also quite similar, as shown in
Figure
,
with rcu_idx now being a lock instead of an
atomic_t.
Figure
(rcu_rcpls.c)
shows the implementation of synchronize_rcu() and its helper
function flip_counter_and_wait().
These are similar to those in
Figure
.
The differences in flip_counter_and_wait() include:
The changes to synchronize_rcu() are more pervasive:
With this approach, if an arbitrarily large number of threads invoke synchronize_rcu() concurrently, with one CPU for each thread, there will be a total of only three waits for counters to go to zero.
Despite the improvements, this implementation of RCU still has a few shortcomings. First, as before, the need to flip rcu_idx twice imposes substantial overhead on updates, especially if there are large numbers of threads.
Second, each updater still acquires rcu_gp_lock, even if there
is no work to be done.
This can result in a severe scalability limitation
if there are large numbers of concurrent updates.
Section shows
one way to avoid this in a production-quality real-time
implementation of RCU for the Linux kernel.
Third, this implementation requires per-thread variables and the ability to enumerate threads, which again can be problematic in some software environments.
Finally, on 32-bit machines, a given update thread might be preempted long enough for the rcu_idx counter to overflow. This could cause such a thread to force an unnecessary pair of counter flips. However, even if each grace period took only one microsecond, the offending thread would need to be preempted for more than an hour, in which case an extra pair of counter flips is likely the least of your worries.
As with the implementation described in
Section ,
the read-side primitives scale extremely well, incurring roughly
115 nanoseconds of overhead regardless of the number of CPUs.
The synchronize_rcu() primitives is still expensive,
ranging from about one microsecond up to about 16 microseconds.
This is nevertheless much cheaper than the roughly 200 microseconds
incurred by the implementation in
Section
.
So, despite its shortcomings, one could imagine this
RCU implementation being used in production in real-life applications.
Quick Quiz 10.47:
All of these toy RCU implementations have either atomic operations
in rcu_read_lock() and rcu_read_unlock(),
or synchronize_rcu()
overhead that increases linearly with the number of threads.
Under what circumstances could an RCU implementation enjoy
light-weight implementations for all three of these primitives,
all having deterministic (
) overheads and latencies?
End Quick Quiz
Referring back to
Figure ,
we see that there is one global-variable access and no fewer than four
accesses to thread-local variables.
Given the relatively high cost of thread-local accesses on systems
implementing POSIX threads, it is tempting to collapse the three
thread-local variables into a single structure, permitting
rcu_read_lock() and rcu_read_unlock() to access their
thread-local data with a single thread-local-storage access.
However, an even better approach would be to reduce the number of
thread-local accesses to one, as is done in the next section.
Paul E. McKenney 2011-12-16