RCU checks for stalled CPUs when the CONFIG_RCU_CPU_STALL_DETECTOR kernel parameter is selected. ``Stalled CPUs'' are those spinning in the kernel with preemption disabled, which degrades response time. These checks are implemented via the record_gp_stall_check_time(), check_cpu_stall(), print_cpu_stall(), and print_other_cpu_stall() functions, each of which is described below. All of these functions are no-ops when the CONFIG_RCU_CPU_STALL_DETECTOR kernel parameter is not selected.
Figure
shows the code for record_gp_stall_check_time().
Line 4 records the current time (of the start of the grace period)
in jiffies, and lines 5-6 record the time at which CPU stalls should
be checked for, should the grace period run on that long.
Figure
shows the code for check_cpu_stall, which checks to see
if the grace period has stretched on too long, invoking either
print_cpu_stall() or print_other_cpu_stall() in order
to print a CPU-stall warning message if so.
Line 8 computes the number of jiffies since the time at which stall warnings should be printed, which will be negative if it is not yet time to print warnings. Line 9 obtains a pointer to the leaf rcu_node structure corresponding to the current CPU, and line 10 checks to see if the current CPU has not yet passed through a quiescent state and if the grace period has extended too long (in other words, if the current CPU is stalled), with line 11 invoking print_cpu_stall() if so.
Otherwise, lines 12-13 check to see if the grace period is still in effect and if it has extended a couple of jiffies past the CPU-stall warning duration, with line 14 invoking print_other_cpu_stall() if so.
Quick Quiz D.53:
Why wait the extra couple jiffies on lines 12-13 in
Figure ?
End Quick Quiz
Figure
shows the code for print_cpu_stall().
Line 6-11 prints a console message and dumps the current CPU's stack, while lines 12-17 compute the time to the next CPU stall warning, should the grace period stretch on that much additional time.
Quick Quiz D.54:
What prevents the grace period from ending before the
stall warning is printed in
Figure ?
End Quick Quiz
Figure
shows the code for print_other_cpu_stall(), which prints out
stall warnings for CPUs other than the currently running CPU.
Lines 10 and 11 pick up references to the first leaf rcu_node structure and one past the last leaf rcu_node structure, respectively. Line 12 acquires the root rcu_node structure's lock, and also disables interrupts. Line 13 calculates the how long ago the CPU-stall warning time occurred (which will be negative if it has not yet occurred), and lines 14 and 15 check to see if the CPU-stall warning time has passed and if the grace period has not yet ended, with line 16 releasing the lock (and re-enabling interrupts) and line 17 returning if so.
Quick Quiz D.55:
Why does print_other_cpu_stall() in
Figure
need to check for the grace period ending when
print_cpu_stall() did not?
End Quick Quiz
Otherwise, lines 19 and 20 compute the next time that CPU stall warnings should be printed (if the grace period extends that long) and line 21 releases the lock and re-enables interrupts. Lines 23-33 print a list of the stalled CPUs, and, finally, line 34 invokes force_quiescent_state() in order to nudge the offending CPUs into passing through a quiescent state.
Paul E. McKenney 2011-12-16