10.3.4.9 RCU Based on Quiescent States

Figure: Data for Quiescent-State-Based RCU
\begin{figure}{ \scriptsize
\begin{verbatim}1 DEFINE_SPINLOCK(rcu_gp_lock);
...
...= 0;
3 DEFINE_PER_THREAD(long, rcu_reader_qs_gp);\end{verbatim}
}\end{figure}

Figure: Quiescent-State-Based RCU Read Side
\begin{figure}{ \scriptsize
\begin{verbatim}1 static void rcu_read_lock(void)...
...nline(void)
26 {
27 rcu_quiescent_state();
28 }\end{verbatim}
}\end{figure}

Figure [*] (rcu_qs.h) shows the read-side primitives used to construct a user-level implementation of RCU based on quiescent states, with the data shown in Figure [*]. As can be seen from lines 1-7 in the figure, the rcu_read_lock() and rcu_read_unlock() primitives do nothing, and can in fact be expected to be inlined and optimized away, as they are in server builds of the Linux kernel. This is due to the fact that quiescent-state-based RCU implementations approximate the extents of RCU read-side critical sections using the aforementioned quiescent states, which contains calls to rcu_quiescent_state(), shown from lines 9-15 in the figure. Threads entering extended quiescent states (for example, when blocking) may instead use the thread_offline() and thread_online() APIs to mark the beginning and the end, respectively, of such an extended quiescent state. As such, thread_online() is analogous to rcu_read_lock() and thread_offline() is analogous to rcu_read_unlock(). These two functions are shown on lines 17-28 in the figure. In either case, it is illegal for a quiescent state to appear within an RCU read-side critical section.

In rcu_quiescent_state(), line 11 executes a memory barrier to prevent any code prior to the quiescent state from being reordered into the quiescent state. Lines 12-13 pick up a copy of the global rcu_gp_ctr, using ACCESS_ONCE() to ensure that the compiler does not employ any optimizations that would result in rcu_gp_ctr being fetched more than once, and then adds one to the value fetched and stores it into the per-thread rcu_reader_qs_gp variable, so that any concurrent instance of synchronize_rcu() will see an odd-numbered value, thus becoming aware that a new RCU read-side critical section has started. Instances of synchronize_rcu() that are waiting on older RCU read-side critical sections will thus know to ignore this new one. Finally, line 14 executes a memory barrier.

Quick Quiz 10.55: Doesn't the additional memory barrier shown on line 14 of Figure [*], greatly increase the overhead of rcu_quiescent_state? End Quick Quiz

Some applications might use RCU only occasionally, but use it very heavily when they do use it. Such applications might choose to use rcu_thread_online() when starting to use RCU and rcu_thread_offline() when no longer using RCU. The time between a call to rcu_thread_offline() and a subsequent call to rcu_thread_online() is an extended quiescent state, so that RCU will not expect explicit quiescent states to be registered during this time.

The rcu_thread_offline() function simply sets the per-thread rcu_reader_qs_gp variable to the current value of rcu_gp_ctr, which has an even-numbered value. Any concurrent instances of synchronize_rcu() will thus know to ignore this thread.

Quick Quiz 10.56: Why are the two memory barriers on lines 19 and 22 of Figure [*] needed? End Quick Quiz

The rcu_thread_online() function simply invokes rcu_quiescent_state(), thus marking the end of the extended quiescent state.

Figure: RCU Update Side Using Quiescent States
\begin{figure}{ \scriptsize
\begin{verbatim}1 void synchronize_rcu(void)
2 {...
... 16 spin_unlock(&rcu_gp_lock);
17 smp_mb();
18 }\end{verbatim}
}\end{figure}

Figure [*] (rcu_qs.c) shows the implementation of synchronize_rcu(), which is quite similar to that of the preceding sections.

This implementation has blazingly fast read-side primitives, with an rcu_read_lock()-rcu_read_unlock() round trip incurring an overhead of roughly 50 picoseconds. The synchronize_rcu() overhead ranges from about 600 nanoseconds on a single-CPU Power5 system up to more than 100 microseconds on a 64-CPU system.

Quick Quiz 10.57: To be sure, the clock frequencies of ca-2008 Power systems were quite high, but even a 5GHz clock frequency is insufficient to allow loops to be executed in 50 picoseconds! What is going on here? End Quick Quiz

However, this implementation requires that each thread either invoke rcu_quiescent_state() periodically or to invoke rcu_thread_offline() for extended quiescent states. The need to invoke these functions periodically can make this implementation difficult to use in some situations, such as for certain types of library functions.

Quick Quiz 10.58: Why would the fact that the code is in a library make any difference for how easy it is to use the RCU implementation shown in Figures [*] and [*]? End Quick Quiz

Quick Quiz 10.59: But what if you hold a lock across a call to synchronize_rcu(), and then acquire that same lock within an RCU read-side critical section? This should be a deadlock, but how can a primitive that generates absolutely no code possibly participate in a deadlock cycle? End Quick Quiz

In addition, this implementation does not permit concurrent calls to synchronize_rcu() to share grace periods. That said, one could easily imagine a production-quality RCU implementation based on this version of RCU.

Paul E. McKenney 2011-12-16