One way to retain update-side scalability while greatly improving read-side performance is to weaken consistency requirements. While the counting algorithm in the previous section is guaranteed to return a value between the value that an ideal counter would have taken on near the beginning of read_count()'s execution and that near the end of read_count()'s execution. Eventual consistency [Vog09] provides a weaker guarantee: in absence of calls to inc_count(), calls to read_count() will eventually return the correct answer.
We exploit eventual consistency by maintaining a global counter. However, updaters only manipulate their per-thread counters. A separate thread is provided to transfer counts from the per-thread counters to the global counter. Readers simply access the value of the global counter. If updaters are active, the value used by the readers will be out of date, however, once updates cease, the global counter will eventually converge on the true value--hence this approach qualifies as eventually consistent.
The implementation is shown in
Figure
(count_stat_eventual.c).
Lines 1-2 show the per-thread variable and the global variable that
track the counter's value, and line three shows stopflag
which is used to coordinate termination (for the case where we want
to terminate the program with an accurate counter value).
The inc_count() function shown on lines 5-8 is identical to its
counterpart in
Figure
.
The read_count() function shown on lines 10-13 simply returns the
value of the global_count variable.
However, the count_init() function on lines 34-42 creates the eventual() thread shown on lines 15-32, which cycles through all the threads, using the atomic_xchg() function to remove count from each thread's local counter, adding the sum to the global_count variable. The eventual() thread waits an arbitrarily chosen one millisecond between passes. The count_cleanup() function on lines 44-50 coordinates termination.
This approach gives extremely fast counter read-out while still supporting linear counter-update performance. However, this excellent read-side performance and update-side scalability comes at the cost of high update-side overhead, due to both the atomic operations and the array indexing hidden in the __get_thread_var() primitive, which can be quite expensive on some CPUs with deep pipelines.
Quick Quiz 6.16:
Why does inc_count() in
Figure
need to use atomic instructions?
End Quick Quiz
Quick Quiz 6.17:
Won't the single global thread in the function eventual() of
Figure
be just as severe a bottleneck as a global lock would be?
End Quick Quiz
Quick Quiz 6.18:
Won't the estimate returned by read_count() in
Figure
become increasingly
inaccurate as the number of threads rises?
End Quick Quiz
Paul E. McKenney 2011-12-16