10.3.4.4 Starvation-Free Counter-Based RCU

Figure: RCU Global Reference-Count Pair Data
\begin{figure}{ \scriptsize
\begin{verbatim}1 DEFINE_SPINLOCK(rcu_gp_lock);
...
...nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);\end{verbatim}
}\end{figure}

Figure: RCU Read-Side Using Global Reference-Count Pair
\begin{figure}{ \scriptsize
\begin{verbatim}1 static void rcu_read_lock(void)...
...}
27 __get_thread_var(rcu_nesting) = n - 1;
28 }\end{verbatim}
}\end{figure}

Figure [*] (rcu_rcgp.h) shows the read-side primitives of an RCU implementation that uses a pair of reference counters (rcu_refcnt[]), along with a global index that selects one counter out of the pair (rcu_idx), a per-thread nesting counter rcu_nesting, a per-thread snapshot of the global index (rcu_read_idx), and a global lock (rcu_gp_lock), which are themselves shown in Figure [*].

The rcu_read_lock() primitive atomically increments the member of the rcu_refcnt[] pair indexed by rcu_idx, and keeps a snapshot of this index in the per-thread variable rcu_read_idx. The rcu_read_unlock() primitive then atomically decrements whichever counter of the pair that the corresponding rcu_read_lock() incremented. However, because only one value of rcu_idx is remembered per thread, additional measures must be taken to permit nesting. These additional measures use the per-thread rcu_nesting variable to track nesting.

To make all this work, line 6 of rcu_read_lock() in Figure [*] picks up the current thread's instance of rcu_nesting, and if line 7 finds that this is the outermost rcu_read_lock(), then lines 8-10 pick up the current value of rcu_idx, save it in this thread's instance of rcu_read_idx, and atomically increment the selected element of rcu_refcnt. Regardless of the value of rcu_nesting, line 12 increments it. Line 13 executes a memory barrier to ensure that the RCU read-side critical section does not bleed out before the rcu_read_lock() code.

Similarly, the rcu_read_unlock() function executes a memory barrier at line 21 to ensure that the RCU read-side critical section does not bleed out after the rcu_read_unlock() code. Line 22 picks up this thread's instance of rcu_nesting, and if line 23 finds that this is the outermost rcu_read_unlock(), then lines 24 and 25 pick up this thread's instance of rcu_read_idx (saved by the outermost rcu_read_lock()) and atomically decrements the selected element of rcu_refcnt. Regardless of the nesting level, line 27 decrements this thread's instance of rcu_nesting.

Figure: RCU Update Using Global Reference-Count Pair
\begin{figure}{ \scriptsize
\begin{verbatim}1 void synchronize_rcu(void)
2 {...
... 19 spin_unlock(&rcu_gp_lock);
20 smp_mb();
21 }\end{verbatim}
}\end{figure}

Figure [*] (rcu_rcpg.c) shows the corresponding synchronize_rcu() implementation. Lines 6 and 19 acquire and release rcu_gp_lock in order to prevent more than one concurrent instance of synchronize_rcu(). Lines 7-8 pick up the value of rcu_idx and complement it, respectively, so that subsequent instances of rcu_read_lock() will use a different element of rcu_idx that did preceding instances. Lines 10-12 then wait for the prior element of rcu_idx to reach zero, with the memory barrier on line 9 ensuring that the check of rcu_idx is not reordered to precede the complementing of rcu_idx. Lines 13-18 repeat this process, and line 20 ensures that any subsequent reclamation operations are not reordered to precede the checking of rcu_refcnt.

Quick Quiz 10.42: Why the memory barrier on line 5 of synchronize_rcu() in Figure [*] given that there is a spin-lock acquisition immediately after? End Quick Quiz

Quick Quiz 10.43: Why is the counter flipped twice in Figure [*]? Shouldn't a single flip-and-wait cycle be sufficient? End Quick Quiz

This implementation avoids the update-starvation issues that could occur in the single-counter implementation shown in Figure [*].

There are still some serious shortcomings. First, the atomic operations in rcu_read_lock() and rcu_read_unlock() are still quite heavyweight. In fact, they are more complex than those of the single-counter variant shown in Figure [*], with the read-side primitives consuming about 150 nanoseconds on a single Power5 CPU and almost 40 microseconds on a 64-CPU system. The updates-side synchronize_rcu() primitive is more costly as well, ranging from about 200 nanoseconds on a single Power5 CPU to more than 40 microseconds on a 64-CPU system. This means that the RCU read-side critical sections have to be extremely long in order to get any real read-side parallelism.

Second, if there are many concurrent rcu_read_lock() and rcu_read_unlock() operations, there will be extreme memory contention on the rcu_refcnt elements, resulting in expensive cache misses. This further extends the RCU read-side critical-section duration required to provide parallel read-side access. These first two shortcomings defeat the purpose of RCU in most situations.

Third, the need to flip rcu_idx twice imposes substantial overhead on updates, especially if there are large numbers of threads.

Finally, despite the fact that concurrent RCU updates could in principle be satisfied by a common grace period, this implementation serializes grace periods, preventing grace-period sharing.

Quick Quiz 10.44: Given that atomic increment and decrement are so expensive, why not just use non-atomic increment on line 10 and a non-atomic decrement on line 25 of Figure [*]? End Quick Quiz

Despite these shortcomings, one could imagine this variant of RCU being used on small tightly coupled multiprocessors, perhaps as a memory-conserving implementation that maintains API compatibility with more complex implementations. However, it would not not likely scale well beyond a few CPUs.

The next section describes yet another variation on the reference-counting scheme that provides greatly improved read-side performance and scalability.

Paul E. McKenney 2011-12-16