D.2.4 Towards a More Scalable RCU Implementation

Figure: Hierarchical RCU State
\resizebox{3in}{!}{\includegraphics{appendix/rcuimpl/TreeClassicRCU}}

One effective way to reduce lock contention is to create a hierarchy, as shown in Figure [*]. Here, each of the four rcu_node structures has its own lock, so that only CPUs 0 and 1 will acquire the lower left rcu_node's lock, only CPUs 2 and 3 will acquire the lower middle rcu_node's lock, and only CPUs 4 and 5 will acquire the lower right rcu_node's lock. During any given grace period, only one of the CPUs accessing each of the lower rcu_node structures will access the upper rcu_node, namely, the last of each pair of CPUs to record a quiescent state for the corresponding grace period.

This results in a significant reduction in lock contention: instead of six CPUs contending for a single lock each grace period, we have only three for the upper rcu_node's lock (a reduction of 50%) and only two for each of the lower rcu_nodes' locks (a reduction of 67%).

Figure: Mapping rcu_node Hierarchy Into Array
\resizebox{3in}{!}{\includegraphics{appendix/rcuimpl/TreeMapping}}

The tree of rcu_node structures is embedded into a linear array in the rcu_state structure, with the root of the tree in element zero, as shown in Figure [*] for an eight-CPU system with a three-level hierarchy. Each arrow links a given rcu_node structure to its parent, representing the rcu_node's ->parent field. Each rcu_node indicates the range of CPUs covered, so that the root node covers all of the CPUs, each node in the second level covers half of the CPUs, and each node in the leaf level covering a pair of CPUs. This array is allocated statically at compile time based on the value of NR_CPUS.

Figure: Hierarchical RCU Grace Period
\resizebox{3in}{!}{\includegraphics{appendix/rcuimpl/TreeClassicRCUGP}}

The sequence of diagrams in Figure [*] shows how grace periods are detected. In the first figure, no CPU has yet passed through a quiescent state, as indicated by the red rectangles. Suppose that all six CPUs simultaneously try to tell RCU that they have passed through a quiescent state. Only one of each pair will be able to acquire the lock on the corresponding lower rcu_node, and so the second figure shows the result if the lucky CPUs are numbers 0, 3, and 5, as indicated by the green rectangles. Once these lucky CPUs have finished, then the other CPUs will acquire the lock, as shown in the third figure. Each of these CPUs will see that they are the last in their group, and therefore all three will attempt to move to the upper rcu_node. Only one at a time can acquire the upper rcu_node structure's lock, and the fourth, fifth, and sixth figures show the sequence of states assuming that CPU 1, CPU 2, and CPU 4 acquire the lock in that order. The sixth and final figure in the group shows that all CPUs have passed through a quiescent state, so that the grace period has ended.

Figure: Hierarchical RCU State 4,096 CPUs
\resizebox{3in}{!}{\includegraphics{appendix/rcuimpl/BigTreeClassicRCU}}

In the above sequence, there were never more than three CPUs contending for any one lock, in happy contrast to Classic RCU, where all six CPUs might contend. However, even more dramatic reductions in lock contention are possible with larger numbers of CPUs. Consider a hierarchy of rcu_node structures, with 64 lower structures and 64*64=4,096 CPUs, as shown in Figure [*].

Here each of the lower rcu_node structures' locks are acquired by 64 CPUs, a 64-times reduction from the 4,096 CPUs that would acquire Classic RCU's single global lock. Similarly, during a given grace period, only one CPU from each of the lower rcu_node structures will acquire the upper rcu_node structure's lock, which is again a 64x reduction from the contention level that would be experienced by Classic RCU running on a 4,096-CPU system.

Quick Quiz D.5: Wait a minute! With all those new locks, how do you avoid deadlock? End Quick Quiz

Quick Quiz D.6: Why stop at a 64-times reduction? Why not go for a few orders of magnitude instead? End Quick Quiz

Quick Quiz D.7: But I don't care about McKenney's lame excuses in the answer to Quick Quiz 2!!! I want to get the number of CPUs contending on a single lock down to something reasonable, like sixteen or so!!! End Quick Quiz

Figure: Hierarchical RCU State With BH
\resizebox{3in}{!}{\includegraphics{appendix/rcuimpl/BigTreeClassicRCUBH}}

The implementation maintains some per-CPU data, such as lists of RCU callbacks, organized into rcu_data structures. In addition, rcu (as in call_rcu()) and rcu_bh (as in call_rcu_bh()) each maintain their own hierarchy, as shown in Figure [*].

Quick Quiz D.8: OK, so what is the story with the colors? End Quick Quiz

The next section discusses energy conservation.

Paul E. McKenney 2011-12-16