Although liveness can be difficult to prove, there is a simple trick that applies here. The first step is to make dyntick_nohz() indicate that it is done via a dyntick_nohz_done variable, as shown on line 27 of the following:
1 proctype dyntick_nohz() 2 { 3 byte tmp; 4 byte i = 0; 5 bit old_gp_idle; 6 7 do 8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break; 9 :: i < MAX_DYNTICK_LOOP_NOHZ -> 10 tmp = dynticks_progress_counter; 11 atomic { 12 dynticks_progress_counter = tmp + 1; 13 old_gp_idle = (grace_period_state == GP_IDLE); 14 assert((dynticks_progress_counter & 1) == 1); 15 } 16 atomic { 17 tmp = dynticks_progress_counter; 18 assert(!old_gp_idle || 19 grace_period_state != GP_DONE); 20 } 21 atomic { 22 dynticks_progress_counter = tmp + 1; 23 assert((dynticks_progress_counter & 1) == 0); 24 } 25 i++; 26 od; 27 dyntick_nohz_done = 1; 28 }
With this variable in place, we can add assertions to grace_period() to check for unnecessary blockage as follows:
1 proctype grace_period() 2 { 3 byte curr; 4 byte snap; 5 bit shouldexit; 6 7 grace_period_state = GP_IDLE; 8 atomic { 9 printf("MDLN = %d\n", MAX_DYNTICK_LOOP_NOHZ); 10 shouldexit = 0; 11 snap = dynticks_progress_counter; 12 grace_period_state = GP_WAITING; 13 } 14 do 15 :: 1 -> 16 atomic { 17 assert(!shouldexit); 18 shouldexit = dyntick_nohz_done; 19 curr = dynticks_progress_counter; 20 if 21 :: (curr == snap) && ((curr & 1) == 0) -> 22 break; 23 :: (curr - snap) > 2 || (snap & 1) == 0 -> 24 break; 25 :: else -> skip; 26 fi; 27 } 28 od; 29 grace_period_state = GP_DONE; 30 grace_period_state = GP_IDLE; 31 atomic { 32 shouldexit = 0; 33 snap = dynticks_progress_counter; 34 grace_period_state = GP_WAITING; 35 } 36 do 37 :: 1 -> 38 atomic { 39 assert(!shouldexit); 40 shouldexit = dyntick_nohz_done; 41 curr = dynticks_progress_counter; 42 if 43 :: (curr == snap) && ((curr & 1) == 0) -> 44 break; 45 :: (curr != snap) -> 46 break; 47 :: else -> skip; 48 fi; 49 } 50 od; 51 grace_period_state = GP_DONE; 52 }
We have added the shouldexit variable on line 5, which we initialize to zero on line 10. Line 17 asserts that shouldexit is not set, while line 18 sets shouldexit to the dyntick_nohz_done variable maintained by dyntick_nohz(). This assertion will therefore trigger if we attempt to take more than one pass through the wait-for-counter-flip-acknowledgement loop after dyntick_nohz() has completed execution. After all, if dyntick_nohz() is done, then there cannot be any more state changes to force us out of the loop, so going through twice in this state means an infinite loop, which in turn means no end to the grace period.
Lines 32, 39, and 40 operate in a similar manner for the second (memory-barrier) loop.
However, running this model (dyntickRCU-base-sl-busted.spin) results in failure, as line 23 is checking that the wrong variable is even. Upon failure, spin writes out a ``trail'' file (dyntickRCU-base-sl-busted.spin.trail) file, which records the sequence of states that lead to the failure. Use the spin -t -p -g -l dyntickRCU-base-sl-busted.spin command to cause spin to retrace this sequence of state, printing the statements executed and the values of variables (dyntickRCU-base-sl-busted.spin.trail.txt). Note that the line numbers do not match the listing above due to the fact that spin takes both functions in a single file. However, the line numbers do match the full model (dyntickRCU-base-sl-busted.spin).
We see that the dyntick_nohz() process completed at step 34 (search for ``34:''), but that the grace_period() process nonetheless failed to exit the loop. The value of curr is 6 (see step 35) and that the value of snap is 5 (see step 17). Therefore the first condition on line 21 above does not hold because curr != snap, and the second condition on line 23 does not hold either because snap is odd and because curr is only one greater than snap.
So one of these two conditions has to be incorrect. Referring to the comment block in rcu_try_flip_waitack_needed() for the first condition:
If the CPU remained in dynticks mode for the entire time and didn't take any interrupts, NMIs, SMIs, or whatever, then it cannot be in the middle of an rcu_read_lock(), so the next rcu_read_lock() it executes must use the new value of the counter. So we can safely pretend that this CPU already acknowledged the counter.
The first condition does match this, because if curr == snap and if curr is even, then the corresponding CPU has been in dynticks-idle mode the entire time, as required. So let's look at the comment block for the second condition:
If the CPU passed through or entered a dynticks idle phase with no active irq handlers, then, as above, we can safely pretend that this CPU already acknowledged the counter.
The first part of the condition is correct, because if curr and snap differ by two, there will be at least one even number in between, corresponding to having passed completely through a dynticks-idle phase. However, the second part of the condition corresponds to having started in dynticks-idle mode, not having finished in this mode. We therefore need to be testing curr rather than snap for being an even number.
The corrected C code is as follows:
1 static inline int 2 rcu_try_flip_waitack_needed(int cpu) 3 { 4 long curr; 5 long snap; 6 7 curr = per_cpu(dynticks_progress_counter, cpu); 8 snap = per_cpu(rcu_dyntick_snapshot, cpu); 9 smp_mb(); 10 if ((curr == snap) && ((curr & 0x1) == 0)) 11 return 0; 12 if ((curr - snap) > 2 || (curr & 0x1) == 0) 13 return 0; 14 return 1; 15 }
Lines 10-13 can now be combined and simplified, resulting in the following. A similar simplification can be applied to rcu_try_flip_waitmb_needed.
1 static inline int 2 rcu_try_flip_waitack_needed(int cpu) 3 { 4 long curr; 5 long snap; 6 7 curr = per_cpu(dynticks_progress_counter, cpu); 8 snap = per_cpu(rcu_dyntick_snapshot, cpu); 9 smp_mb(); 10 if ((curr - snap) >= 2 || (curr & 0x1) == 0) 11 return 0; 12 return 1; 13 }
Making the corresponding correction in the model (dyntickRCU-base-sl.spin) results in a correct verification with 661 states that passes without errors. However, it is worth noting that the first version of the liveness verification failed to catch this bug, due to a bug in the liveness verification itself. This liveness-verification bug was located by inserting an infinite loop in the grace_period() process, and noting that the liveness-verification code failed to detect this problem!
We have now successfully verified both safety and liveness conditions, but only for processes running and blocking. We also need to handle interrupts, a task taken up in the next section.
Paul E. McKenney 2011-12-16