10.3.4.3 Simple Counter-Based RCU

Figure: RCU Implementation Using Single Global Reference Counter
\begin{figure}{ \scriptsize
\begin{verbatim}1 atomic_t rcu_refcnt;
2
3 stat...
...{
19 poll(NULL, 0, 10);
20 }
21 smp_mb();
22 }\end{verbatim}
}\end{figure}

A slightly more sophisticated RCU implementation is shown in Figure [*] (rcu_rcg.h and rcu_rcg.c). This implementation makes use of a global reference counter rcu_refcnt defined on line 1. The rcu_read_lock() primitive atomically increments this counter, then executes a memory barrier to ensure that the RCU read-side critical section is ordered after the atomic increment. Similarly, rcu_read_unlock() executes a memory barrier to confine the RCU read-side critical section, then atomically decrements the counter. The synchronize_rcu() primitive spins waiting for the reference counter to reach zero, surrounded by memory barriers. The poll() on line 19 merely provides pure delay, and from a pure RCU-semantics point of view could be omitted. Again, once synchronize_rcu() returns, all prior RCU read-side critical sections are guaranteed to have completed.

In happy contrast to the lock-based implementation shown in Section [*], this implementation allows parallel execution of RCU read-side critical sections. In happy contrast to the per-thread lock-based implementation shown in Section [*], it also allows them to be nested. In addition, the rcu_read_lock() primitive cannot possibly participate in deadlock cycles, as it never spins nor blocks.

Quick Quiz 10.39: But what if you hold a lock across a call to synchronize_rcu(), and then acquire that same lock within an RCU read-side critical section? End Quick Quiz

However, this implementations still has some serious shortcomings. First, the atomic operations in rcu_read_lock() and rcu_read_unlock() are still quite heavyweight, with read-side overhead ranging from about 100 nanoseconds on a single Power5 CPU up to almost 40 microseconds on a 64-CPU system. This means that the RCU read-side critical sections have to be extremely long in order to get any real read-side parallelism. On the other hand, in the absence of readers, grace periods elapse in about 40 nanoseconds, many orders of magnitude faster than production-quality implementations in the Linux kernel.

Quick Quiz 10.40: How can the grace period possibly elapse in 40 nanoseconds when synchronize_rcu() contains a 10-millisecond delay? End Quick Quiz

Second, if there are many concurrent rcu_read_lock() and rcu_read_unlock() operations, there will be extreme memory contention on rcu_refcnt, resulting in expensive cache misses. Both of these first two shortcomings largely defeat a major purpose of RCU, namely to provide low-overhead read-side synchronization primitives.

Finally, a large number of RCU readers with long read-side critical sections could prevent synchronize_rcu() from ever completing, as the global counter might never reach zero. This could result in starvation of RCU updates, which is of course unacceptable in production settings.

Quick Quiz 10.41: Why not simply make rcu_read_lock() wait when a concurrent synchronize_rcu() has been waiting too long in the RCU implementation in Figure [*]? Wouldn't that prevent synchronize_rcu() from starving? End Quick Quiz

Therefore, it is still hard to imagine this implementation being useful in a production setting, though it has a bit more potential than the lock-based mechanism, for example, as an RCU implementation suitable for a high-stress debugging environment. The next section describes a variation on the reference-counting scheme that is more favorable to writers.

Paul E. McKenney 2011-12-16