10.3.4.6 Scalable Counter-Based RCU With Shared Grace Periods

Figure: RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update Data
\begin{figure}{ \scriptsize
\begin{verbatim}1 DEFINE_SPINLOCK(rcu_gp_lock);
...
...nesting);
5 DEFINE_PER_THREAD(int, rcu_read_idx);\end{verbatim}
}\end{figure}

Figure: RCU Read-Side Using Per-Thread Reference-Count Pair and Shared Update
\begin{figure}{ \scriptsize
\begin{verbatim}1 static void rcu_read_lock(void)...
...}
27 __get_thread_var(rcu_nesting) = n - 1;
28 }\end{verbatim}
}\end{figure}

Figure [*] (rcu_rcpls.h) shows the read-side primitives for an RCU implementation using per-thread reference count pairs, as before, but permitting updates to share grace periods. The main difference from the earlier implementation shown in Figure [*] is that rcu_idx is now a long that counts freely, so that line 8 of Figure [*] must mask off the low-order bit. We also switched from using atomic_read() and atomic_set() to using ACCESS_ONCE(). The data is also quite similar, as shown in Figure [*], with rcu_idx now being a lock instead of an atomic_t.

Figure: RCU Shared Update Using Per-Thread Reference-Count Pair
\begin{figure}{ \scriptsize
\begin{verbatim}1 static void flip_counter_and_wa...
... 35 spin_unlock(&rcu_gp_lock);
36 smp_mb();
37 }\end{verbatim}
}\end{figure}

Figure [*] (rcu_rcpls.c) shows the implementation of synchronize_rcu() and its helper function flip_counter_and_wait(). These are similar to those in Figure [*]. The differences in flip_counter_and_wait() include:

  1. Line 6 uses ACCESS_ONCE() instead of atomic_set(), and increments rather than complementing.
  2. A new line 7 masks the counter down to its bottom bit.

The changes to synchronize_rcu() are more pervasive:

  1. There is a new oldctr local variable that captures the pre-lock-acquisition value of rcu_idx on line 23.
  2. Line 26 uses ACCESS_ONCE() instead of atomic_read().
  3. Lines 27-30 check to see if at least three counter flips were performed by other threads while the lock was being acquired, and, if so, releases the lock, does a memory barrier, and returns. In this case, there were two full waits for the counters to go to zero, so those other threads already did all the required work.
  4. At lines 33-34, flip_counter_and_wait() is only invoked a second time if there were fewer than two counter flips while the lock was being acquired. On the other hand, if there were two counter flips, some other thread did one full wait for all the counters to go to zero, so only one more is required.

With this approach, if an arbitrarily large number of threads invoke synchronize_rcu() concurrently, with one CPU for each thread, there will be a total of only three waits for counters to go to zero.

Despite the improvements, this implementation of RCU still has a few shortcomings. First, as before, the need to flip rcu_idx twice imposes substantial overhead on updates, especially if there are large numbers of threads.

Second, each updater still acquires rcu_gp_lock, even if there is no work to be done. This can result in a severe scalability limitation if there are large numbers of concurrent updates. Section [*] shows one way to avoid this in a production-quality real-time implementation of RCU for the Linux kernel.

Third, this implementation requires per-thread variables and the ability to enumerate threads, which again can be problematic in some software environments.

Finally, on 32-bit machines, a given update thread might be preempted long enough for the rcu_idx counter to overflow. This could cause such a thread to force an unnecessary pair of counter flips. However, even if each grace period took only one microsecond, the offending thread would need to be preempted for more than an hour, in which case an extra pair of counter flips is likely the least of your worries.

As with the implementation described in Section [*], the read-side primitives scale extremely well, incurring roughly 115 nanoseconds of overhead regardless of the number of CPUs. The synchronize_rcu() primitives is still expensive, ranging from about one microsecond up to about 16 microseconds. This is nevertheless much cheaper than the roughly 200 microseconds incurred by the implementation in Section [*]. So, despite its shortcomings, one could imagine this RCU implementation being used in production in real-life applications.

Quick Quiz 10.47: All of these toy RCU implementations have either atomic operations in rcu_read_lock() and rcu_read_unlock(), or synchronize_rcu() overhead that increases linearly with the number of threads. Under what circumstances could an RCU implementation enjoy light-weight implementations for all three of these primitives, all having deterministic ( $O\left(1\right)$) overheads and latencies? End Quick Quiz

Referring back to Figure [*], we see that there is one global-variable access and no fewer than four accesses to thread-local variables. Given the relatively high cost of thread-local accesses on systems implementing POSIX threads, it is tempting to collapse the three thread-local variables into a single structure, permitting rcu_read_lock() and rcu_read_unlock() to access their thread-local data with a single thread-local-storage access. However, an even better approach would be to reduce the number of thread-local accesses to one, as is done in the next section.

Paul E. McKenney 2011-12-16