D.2.8 Testing

RCU is fundamental synchronization code, so any failure of RCU results in random, difficult-to-debug memory corruption. It is therefore extremely important that RCU be highly reliable. Some of this reliability stems from careful design, but at the end of the day we must also rely on heavy stress testing, otherwise known as torture.

Fortunately, although there has been some debate as to exactly what populations are covered by the provisions of the Geneva Convention it is still the case that it does not apply to software. Therefore, it is still legal to torture your software. In fact, it is strongly encouraged, because if you don't torture your software, it will end up torturing you by crashing at the most inconvenient times imaginable.

Therefore, we torture RCU quite vigorously using the rcutorture module.

However, it is not sufficient to torture the common-case uses of RCU. It is also necessary to torture it in unusual situations, for example, when concurrently onlining and offlining CPUs and when CPUs are concurrently entering and exiting dynticks idle mode. I use a script @@@ move to CodeSamples, ref @@@ and use the test_no_idle_hz module parameter to rcutorture to stress-test dynticks idle mode. Just to be fully paranoid, I sometimes run a kernbench workload in parallel as well. Ten hours of this sort of torture on a 128-way machine seems sufficient to shake out most bugs.

Even this is not the complete story. As Alexey Dobriyan and Nick Piggin demonstrated in early 2008, it is also necessary to torture RCU with all relevant combinations of kernel parameters. The relevant kernel parameters may be identified using yet another script @@@ move to CodeSamples, ref @@@

  1. CONFIG_CLASSIC_RCU: Classic RCU.
  2. CONFIG_PREEMPT_RCU: Preemptible (real-time) RCU.
  3. CONFIG_TREE_RCU: Classic RCU for huge SMP systems.
  4. CONFIG_RCU_FANOUT: Number of children for each rcu_node.
  5. CONFIG_RCU_FANOUT_EXACT: Balance the rcu_node tree.
  6. CONFIG_HOTPLUG_CPU: Allow CPUs to be offlined and onlined.
  7. CONFIG_NO_HZ: Enable dyntick-idle mode.
  8. CONFIG_SMP: Enable multi-CPU operation.
  9. CONFIG_RCU_CPU_STALL_DETECTOR: Enable RCU to detect when CPUs go on extended quiescent-state vacations.
  10. CONFIG_RCU_TRACE: Generate RCU trace files in debugfs.

We ignore the CONFIG_DEBUG_LOCK_ALLOC configuration variable under the perhaps-naive assumption that hierarchical RCU could not have broken lockdep. There are still 10 configuration variables, which would result in 1,024 combinations if they were independent boolean variables. Fortunately the first three are mutually exclusive, which reduces the number of combinations down to 384, but CONFIG_RCU_FANOUT can take on values from 2 to 64, increasing the number of combinations to 12,096. This is an infeasible number of combinations.

One key observation is that only CONFIG_NO_HZ and CONFIG_PREEMPT can be expected to have changed behavior if either CONFIG_CLASSIC_RCU or CONFIG_PREEMPT_RCU are in effect, as only these portions of the two pre-existing RCU implementations were changed during this effort. This cuts out almost two thirds of the possible combinations.

Furthermore, not all of the possible values of CONFIG_RCU_FANOUT produce significantly different results, in fact only a few cases really need to be tested separately:

  1. Single-node ``tree''.
  2. Two-level balanced tree.
  3. Three-level balanced tree.
  4. Autobalanced tree, where CONFIG_RCU_FANOUT specifies an unbalanced tree, but such that it is auto-balanced in absence of CONFIG_RCU_FANOUT_EXACT.
  5. Unbalanced tree.

Looking further, CONFIG_HOTPLUG_CPU makes sense only given CONFIG_SMP, and CONFIG_RCU_CPU_STALL_DETECTOR is independent, and really only needs to be tested once (though someone even more paranoid than am I might decide to test it both with and without CONFIG_SMP). Similarly, CONFIG_RCU_TRACE need only be tested once, but the truly paranoid (such as myself) will choose to run it both with and without CONFIG_NO_HZ.

This allows us to obtain excellent coverage of RCU with only 15 test cases. All test cases specify the following configuration parameters in order to run rcutorture and so that CONFIG_HOTPLUG_CPU=n actually takes effect:



CONFIG_RCU_TORTURE_TEST=m
CONFIG_MODULE_UNLOAD=y
CONFIG_SUSPEND=n
CONFIG_HIBERNATION=n


The 15 test cases are as follows:

  1. Force single-node ``tree'' for small systems:



    	CONFIG_NR_CPUS=8
    	CONFIG_RCU_FANOUT=8
    	CONFIG_RCU_FANOUT_EXACT=n
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  2. Force two-level tree for large systems:



    	CONFIG_NR_CPUS=8
    	CONFIG_RCU_FANOUT=4
    	CONFIG_RCU_FANOUT_EXACT=n
    	CONFIG_RCU_TRACE=n
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  3. Force three-level tree for huge systems:



    	CONFIG_NR_CPUS=8
    	CONFIG_RCU_FANOUT=2
    	CONFIG_RCU_FANOUT_EXACT=n
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  4. Test autobalancing to a balanced tree:



    	CONFIG_NR_CPUS=8
    	CONFIG_RCU_FANOUT=6
    	CONFIG_RCU_FANOUT_EXACT=n
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  5. Test unbalanced tree:



    	CONFIG_NR_CPUS=8
    	CONFIG_RCU_FANOUT=6
    	CONFIG_RCU_FANOUT_EXACT=y
    	CONFIG_RCU_CPU_STALL_DETECTOR=y
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  6. Disable CPU-stall detection:



    	CONFIG_SMP=y
    	CONFIG_NO_HZ=y
    	CONFIG_RCU_CPU_STALL_DETECTOR=n
    	CONFIG_HOTPLUG_CPU=y
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  7. Disable CPU-stall detection and dyntick idle mode:



    	CONFIG_SMP=y
    	CONFIG_NO_HZ=n
    	CONFIG_RCU_CPU_STALL_DETECTOR=n
    	CONFIG_HOTPLUG_CPU=y
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  8. Disable CPU-stall detection and CPU hotplug:



    	CONFIG_SMP=y
    	CONFIG_NO_HZ=y
    	CONFIG_RCU_CPU_STALL_DETECTOR=n
    	CONFIG_HOTPLUG_CPU=n
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  9. Disable CPU-stall detection, dyntick idle mode, and CPU hotplug:



    	CONFIG_SMP=y
    	CONFIG_NO_HZ=n
    	CONFIG_RCU_CPU_STALL_DETECTOR=n
    	CONFIG_HOTPLUG_CPU=n
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  10. Disable SMP, CPU-stall detection, dyntick idle mode, and CPU hotplug:



    	CONFIG_SMP=n
    	CONFIG_NO_HZ=n
    	CONFIG_RCU_CPU_STALL_DETECTOR=n
    	CONFIG_HOTPLUG_CPU=n
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


    This combination located a number of compiler warnings.

  11. Disable SMP and CPU hotplug:



    	CONFIG_SMP=n
    	CONFIG_NO_HZ=y
    	CONFIG_RCU_CPU_STALL_DETECTOR=y
    	CONFIG_HOTPLUG_CPU=n
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=y
    


  12. Test Classic RCU with dynticks idle but without preemption:



    	CONFIG_NO_HZ=y
    	CONFIG_PREEMPT=n
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=y
    	CONFIG_TREE_RCU=n
    


  13. Test Classic RCU with preemption but without dynticks idle:



    	CONFIG_NO_HZ=n
    	CONFIG_PREEMPT=y
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=n
    	CONFIG_CLASSIC_RCU=y
    	CONFIG_TREE_RCU=n
    


  14. Test Preemptible RCU with dynticks idle:



    	CONFIG_NO_HZ=y
    	CONFIG_PREEMPT=y
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=y
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=n
    


  15. Test Preemptible RCU without dynticks idle:



    	CONFIG_NO_HZ=n
    	CONFIG_PREEMPT=y
    	CONFIG_RCU_TRACE=y
    	CONFIG_PREEMPT_RCU=y
    	CONFIG_CLASSIC_RCU=n
    	CONFIG_TREE_RCU=n
    


For a large change that affects RCU core code, one should run rcutorture for each of the above combinations, and concurrently with CPU offlining and onlining for cases with CONFIG_HOTPLUG_CPU. For small changes, it may suffice to run kernbench in each case. Of course, if the change is confined to a particular subset of the configuration parameters, it may be possible to reduce the number of test cases.

Torturing software: the Geneva Convention does not (yet) prohibit it, and I strongly recommend it!

Paul E. McKenney 2011-12-16