Figure
(rcu_nest.h and rcu_nest.c)
show an RCU implementation based on a single global free-running counter,
but that permits nesting of RCU read-side critical sections.
This nestability is accomplished by reserving the low-order bits of the
global rcu_gp_ctr to count nesting, using the definitions shown in
Figure
.
This is a generalization of the scheme in
Section
,
which can be thought of as having a single low-order bit reserved
for counting nesting depth.
Two C-preprocessor macros are used to arrange this,
RCU_GP_CTR_NEST_MASK and
RCU_GP_CTR_BOTTOM_BIT.
These are related: RCU_GP_CTR_NEST_MASK=RCU_GP_CTR_BOTTOM_BIT-1.
The RCU_GP_CTR_BOTTOM_BIT macro contains a single bit that is
positioned just above the bits reserved for counting nesting,
and the RCU_GP_CTR_NEST_MASK has all one bits covering the
region of rcu_gp_ctr used to count nesting.
Obviously, these two C-preprocessor macros must reserve enough
of the low-order bits of the counter to permit the maximum required
nesting of RCU read-side critical sections, and this implementation
reserves seven bits, for a maximum RCU read-side critical-section
nesting depth of 127, which should be well in excess of that needed
by most applications.
The resulting rcu_read_lock() implementation is still reasonably straightforward. Line 6 places a pointer to this thread's instance of rcu_reader_gp into the local variable rrgp, minimizing the number of expensive calls to the pthreads thread-local-state API. Line 7 records the current value of rcu_reader_gp into another local variable tmp, and line 8 checks to see if the low-order bits are zero, which would indicate that this is the outermost rcu_read_lock(). If so, line 9 places the global rcu_gp_ctr into tmp because the current value previously fetched by line 7 is likely to be obsolete. In either case, line 10 increments the nesting depth, which you will recall is stored in the seven low-order bits of the counter. Line 11 stores the updated counter back into this thread's instance of rcu_reader_gp, and, finally, line 12 executes a memory barrier to prevent the RCU read-side critical section from bleeding out into the code preceding the call to rcu_read_lock().
In other words, this implementation of rcu_read_lock() picks up a copy of the global rcu_gp_ctr unless the current invocation of rcu_read_lock() is nested within an RCU read-side critical section, in which case it instead fetches the contents of the current thread's instance of rcu_reader_gp. Either way, it increments whatever value it fetched in order to record an additional nesting level, and stores the result in the current thread's instance of rcu_reader_gp.
Interestingly enough, the implementation of rcu_read_unlock()
is identical to that shown in
Section .
Line 19 executes a memory barrier in order to prevent the RCU read-side
critical section from bleeding out into code following the call
to rcu_read_unlock(), and
line 20 decrements this thread's instance of rcu_reader_gp,
which has the effect of decrementing the nesting count contained in
rcu_reader_gp's low-order bits.
Debugging versions of this primitive would check (before decrementing!)
that these low-order bits were non-zero.
The implementation of synchronize_rcu() is quite similar to
that shown in
Section .
There are two differences.
The first is that line 29 adds RCU_GP_CTR_BOTTOM_BIT
to the global rcu_gp_ctr instead of adding the constant ``2'',
and the second is that the comparison on line 32 has been abstracted
out to a separate function, where it checks the bit indicated
by RCU_GP_CTR_BOTTOM_BIT instead of unconditionally checking
the low-order bit.
This approach achieves read-side performance almost equal to that
shown in
Section , incurring
roughly 65 nanoseconds of overhead regardless of the number of
Power5 CPUs.
Updates again incur more overhead, ranging from about 600 nanoseconds on
a single Power5 CPU to more than 100 microseconds on 64
such CPUs.
Quick Quiz 10.52: Why not simply maintain a separate per-thread nesting-level variable, as was done in previous section, rather than having all this complicated bit manipulation? End Quick Quiz
This implementation suffers from the same shortcomings as does that of
Section , except that
nesting of RCU read-side critical sections is now permitted.
In addition, on 32-bit systems, this approach shortens the time
required to overflow the global rcu_gp_ctr variable.
The following section shows one way to greatly increase the time
required for overflow to occur, while greatly reducing read-side
overhead.
Quick Quiz 10.53:
Given the algorithm shown in
Figure ,
how could you double the time required to overflow the global
rcu_gp_ctr?
End Quick Quiz
Quick Quiz 10.54:
Again, given the algorithm shown in
Figure ,
is counter overflow fatal?
Why or why not?
If it is fatal, what can be done to fix it?
End Quick Quiz
Paul E. McKenney 2011-12-16