10.3.4.7 RCU Based on Free-Running Counter

Figure: Data for Free-Running Counter Using RCU
\begin{figure}{ \scriptsize
\begin{verbatim}1 DEFINE_SPINLOCK(rcu_gp_lock);
...
...);
4 DEFINE_PER_THREAD(long, rcu_reader_gp_snap);\end{verbatim}
}\end{figure}

Figure: Free-Running Counter Using RCU
\begin{figure}{ \scriptsize
\begin{verbatim}1 static void rcu_read_lock(void)...
... 28 spin_unlock(&rcu_gp_lock);
29 smp_mb();
30 }\end{verbatim}
}\end{figure}

Figure [*] (rcu.h and rcu.c) show an RCU implementation based on a single global free-running counter that takes on only even-numbered values, with data shown in Figure [*]. The resulting rcu_read_lock() implementation is extremely straightforward. Line 3 simply adds one to the global free-running rcu_gp_ctr variable and stores the resulting odd-numbered value into the rcu_reader_gp per-thread variable. Line 4 executes a memory barrier to prevent the content of the subsequent RCU read-side critical section from ``leaking out''.

The rcu_read_unlock() implementation is similar. Line 9 executes a memory barrier, again to prevent the prior RCU read-side critical section from ``leaking out''. Line 10 then copies the rcu_gp_ctr global variable to the rcu_reader_gp per-thread variable, leaving this per-thread variable with an even-numbered value so that a concurrent instance of synchronize_rcu() will know to ignore it.

Quick Quiz 10.48: If any even value is sufficient to tell synchronize_rcu() to ignore a given task, why doesn't line 10 of Figure [*] simply assign zero to rcu_reader_gp? End Quick Quiz

Thus, synchronize_rcu() could wait for all of the per-thread rcu_reader_gp variables to take on even-numbered values. However, it is possible to do much better than that because synchronize_rcu() need only wait on pre-existing RCU read-side critical sections. Line 17 executes a memory barrier to prevent prior manipulations of RCU-protected data structures from being reordered (by either the CPU or the compiler) to follow the increment on line 17. Line 18 acquires the rcu_gp_lock (and line 28 releases it) in order to prevent multiple synchronize_rcu() instances from running concurrently. Line 19 then increments the global rcu_gp_ctr variable by two, so that all pre-existing RCU read-side critical sections will have corresponding per-thread rcu_reader_gp variables with values less than that of rcu_gp_ctr, modulo the machine's word size. Recall also that threads with even-numbered values of rcu_reader_gp are not in an RCU read-side critical section, so that lines 21-27 scan the rcu_reader_gp values until they all are either even (line 22) or are greater than the global rcu_gp_ctr (lines 23-24). Line 25 blocks for a short period of time to wait for a pre-existing RCU read-side critical section, but this can be replaced with a spin-loop if grace-period latency is of the essence. Finally, the memory barrier at line 29 ensures that any subsequent destruction will not be reordered into the preceding loop.

Quick Quiz 10.49: Why are the memory barriers on lines 17 and 29 of Figure [*] needed? Aren't the memory barriers inherent in the locking primitives on lines 18 and 28 sufficient? End Quick Quiz

This approach achieves much better read-side performance, incurring roughly 63 nanoseconds of overhead regardless of the number of Power5 CPUs. Updates incur more overhead, ranging from about 500 nanoseconds on a single Power5 CPU to more than 100 microseconds on 64 such CPUs.

Quick Quiz 10.50: Couldn't the update-side optimization described in Section [*] be applied to the implementation shown in Figure [*]? End Quick Quiz

This implementation suffers from some serious shortcomings in addition to the high update-side overhead noted earlier. First, it is no longer permissible to nest RCU read-side critical sections, a topic that is taken up in the next section. Second, if a reader is preempted at line 3 of Figure [*] after fetching from rcu_gp_ctr but before storing to rcu_reader_gp, and if the rcu_gp_ctr counter then runs through more than half but less than all of its possible values, then synchronize_rcu() will ignore the subsequent RCU read-side critical section. Third and finally, this implementation requires that the enclosing software environment be able to enumerate threads and maintain per-thread variables.

Quick Quiz 10.51: Is the possibility o readers being preempted in line 3 of Figure [*] a real problem, in other words, is there a real sequence of events that could lead to failure? If not, why not? If so, what is the sequence of events, and how can the failure be addressed? End Quick Quiz

Paul E. McKenney 2011-12-16