A slightly more sophisticated RCU implementation is shown in
Figure
(rcu_rcg.h and rcu_rcg.c).
This implementation makes use of a global reference counter
rcu_refcnt defined on line 1.
The rcu_read_lock() primitive atomically increments this
counter, then executes a memory barrier to ensure that the
RCU read-side critical section is ordered after the atomic
increment.
Similarly, rcu_read_unlock() executes a memory barrier to
confine the RCU read-side critical section, then atomically
decrements the counter.
The synchronize_rcu() primitive spins waiting for the reference
counter to reach zero, surrounded by memory barriers.
The poll() on line 19 merely provides pure delay, and from
a pure RCU-semantics point of view could be omitted.
Again, once synchronize_rcu() returns, all prior
RCU read-side critical sections are guaranteed to have completed.
In happy contrast to the lock-based implementation shown in
Section , this implementation
allows parallel execution of RCU read-side critical sections.
In happy contrast to the per-thread lock-based implementation shown in
Section
,
it also allows them to be nested.
In addition, the rcu_read_lock() primitive cannot possibly
participate in deadlock cycles, as it never spins nor blocks.
Quick Quiz 10.39: But what if you hold a lock across a call to synchronize_rcu(), and then acquire that same lock within an RCU read-side critical section? End Quick Quiz
However, this implementations still has some serious shortcomings. First, the atomic operations in rcu_read_lock() and rcu_read_unlock() are still quite heavyweight, with read-side overhead ranging from about 100 nanoseconds on a single Power5 CPU up to almost 40 microseconds on a 64-CPU system. This means that the RCU read-side critical sections have to be extremely long in order to get any real read-side parallelism. On the other hand, in the absence of readers, grace periods elapse in about 40 nanoseconds, many orders of magnitude faster than production-quality implementations in the Linux kernel.
Quick Quiz 10.40: How can the grace period possibly elapse in 40 nanoseconds when synchronize_rcu() contains a 10-millisecond delay? End Quick Quiz
Second, if there are many concurrent rcu_read_lock() and rcu_read_unlock() operations, there will be extreme memory contention on rcu_refcnt, resulting in expensive cache misses. Both of these first two shortcomings largely defeat a major purpose of RCU, namely to provide low-overhead read-side synchronization primitives.
Finally, a large number of RCU readers with long read-side critical sections could prevent synchronize_rcu() from ever completing, as the global counter might never reach zero. This could result in starvation of RCU updates, which is of course unacceptable in production settings.
Quick Quiz 10.41:
Why not simply make rcu_read_lock() wait when a concurrent
synchronize_rcu() has been waiting too long in
the RCU implementation in
Figure ?
Wouldn't that prevent synchronize_rcu() from starving?
End Quick Quiz
Therefore, it is still hard to imagine this implementation being useful in a production setting, though it has a bit more potential than the lock-based mechanism, for example, as an RCU implementation suitable for a high-stress debugging environment. The next section describes a variation on the reference-counting scheme that is more favorable to writers.
Paul E. McKenney 2011-12-16