One key attribute of RCU is the ability to safely scan data, even
though that data is being modified concurrently.
To provide this ability for concurrent insertion,
RCU uses what can be thought of as a publish-subscribe mechanism.
For example, consider an initially NULL global pointer
gp that is to be modified to point to a newly allocated
and initialized data structure.
The code fragment shown in
Figure
(with the addition of appropriate locking)
might be used for this purpose.
Unfortunately, there is nothing forcing the compiler and CPU to execute the last four assignment statements in order. If the assignment to gp happens before the initialization of p fields, then concurrent readers could see the uninitialized values. Memory barriers are required to keep things ordered, but memory barriers are notoriously difficult to use. We therefore encapsulate them into a primitive rcu_assign_pointer() that has publication semantics. The last four lines would then be as follows:
1 p->a = 1; 2 p->b = 2; 3 p->c = 3; 4 rcu_assign_pointer(gp, p); |
The rcu_assign_pointer() would publish the new structure, forcing both the compiler and the CPU to execute the assignment to gp after the assignments to the fields referenced by p
However, it is not sufficient to only enforce ordering at the updater, as the reader must enforce proper ordering as well. Consider for example the following code fragment:
1 p = gp; 2 if (p != NULL) { 3 do_something_with(p->a, p->b, p->c); 4 } |
Although this code fragment might well seem immune to misordering, unfortunately, the DEC Alpha CPU [McK05a,McK05b] and value-speculation compiler optimizations can, believe it or not, cause the values of p->a, p->b, and p->c to be fetched before the value of p. This is perhaps easiest to see in the case of value-speculation compiler optimizations, where the compiler guesses the value of p fetches p->a, p->b, and p->c then fetches the actual value of p in order to check whether its guess was correct. This sort of optimization is quite aggressive, perhaps insanely so, but does actually occur in the context of profile-driven optimization.
Clearly, we need to prevent this sort of skullduggery on the part of both the compiler and the CPU. The rcu_dereference() primitive uses whatever memory-barrier instructions and compiler directives are required for this purpose:
1 rcu_read_lock(); 2 p = rcu_dereference(gp); 3 if (p != NULL) { 4 do_something_with(p->a, p->b, p->c); 5 } 6 rcu_read_unlock(); |
The rcu_dereference() primitive can thus be thought of
as subscribing to a given value of the specified pointer,
guaranteeing that subsequent dereference operations will see any
initialization that occurred before the corresponding publish
(rcu_assign_pointer() operation.
The rcu_read_lock() and rcu_read_unlock()
calls are absolutely required: they define the extent of the
RCU read-side critical section.
Their purpose is explained in
Section ,
however, they never spin or block, nor do they prevent the
list_add_rcu() from executing concurrently.
In fact, in non-CONFIG_PREEMPT kernels, they generate
absolutely no code.
Although rcu_assign_pointer() and
rcu_dereference() can in theory be used to construct any
conceivable RCU-protected data structure, in practice it is often better
to use higher-level constructs.
Therefore, the rcu_assign_pointer() and
rcu_dereference()
primitives have been embedded in special RCU variants of Linux's
list-manipulation API.
Linux has two variants of doubly linked list, the circular
struct list_head and the linear
struct hlist_head/struct hlist_node pair.
The former is laid out as shown in
Figure ,
where the green boxes represent
the list header and the blue boxes represent the elements in the
list.
This notation is cumbersome, and will therefore be abbreviated as shown in
Figure
.
Adapting the pointer-publish example for the linked list results in
the code shown in
Figure .
Line 15 must be protected by some synchronization mechanism (most commonly some sort of lock) to prevent multiple list_add() instances from executing concurrently. However, such synchronization does not prevent this list_add() instance from executing concurrently with RCU readers.
Subscribing to an RCU-protected list is straightforward:
1 rcu_read_lock(); 2 list_for_each_entry_rcu(p, head, list) { 3 do_something_with(p->a, p->b, p->c); 4 } 5 rcu_read_unlock(); |
The list_add_rcu() primitive publishes an entry into the specified list, guaranteeing that the corresponding list_for_each_entry_rcu() invocation will properly subscribe to this same entry.
Quick Quiz 10.7: What prevents the list_for_each_entry_rcu() from getting a segfault if it happens to execute at exactly the same time as the list_add_rcu()? End Quick Quiz
Linux's other doubly linked list, the hlist,
is a linear list, which means that
it needs only one pointer for the header rather than the two
required for the circular list, as shown in
Figure.
Thus, use of hlist can halve the memory consumption for the hash-bucket
arrays of large hash tables.
As before, this notation is cumbersome, so hlists will be abbreviated
in the same way lists are, as shown in
Figure
.
Publishing a new element to an RCU-protected hlist is quite similar
to doing so for the circular list,
as shown in Figure .
As before, line 15 must be protected by some sort of synchronization mechanism, for example, a lock.
Subscribing to an RCU-protected hlist is also similar to the circular list:
1 rcu_read_lock(); 2 hlist_for_each_entry_rcu(p, q, head, list) { 3 do_something_with(p->a, p->b, p->c); 4 } 5 rcu_read_unlock(); |
Quick Quiz 10.8: Why do we need to pass two pointers into hlist_for_each_entry_rcu() when only one is needed for list_for_each_entry_rcu()? End Quick Quiz
The set of RCU publish and subscribe primitives are shown in
Table ,
along with additional primitives to ``unpublish'', or retract.
Note that the list_replace_rcu(), list_del_rcu(), hlist_replace_rcu(), and hlist_del_rcu() APIs add a complication. When is it safe to free up the data element that was replaced or removed? In particular, how can we possibly know when all the readers have released their references to that data element?
These questions are addressed in the following section.
Paul E. McKenney 2011-12-16