Let's start with something simple, for example, the straightforward
use of arithmetic shown in
Figure (count_nonatomic.c).
Here, we have a counter on line 1, we increment it on line 5, and we
read out its value on line 10.
What could be simpler?
This approach has the additional advantage of being blazingly fast if you are doing lots of reading and almost no incrementing, and on small systems, the performance is excellent.
There is just one large fly in the ointment: this approach can lose counts. On my dual-core laptop, a short run invoked inc_count() 100,014,000 times, but the final value of the counter was only 52,909,118. Although it is true that approximate values have their place in computing, it is almost always necessary to do better than this.
Quick Quiz 6.6: But doesn't the ++ operator produce an x86 add-to-memory instruction? And won't the CPU cache cause this to be atomic? End Quick Quiz
Quick Quiz 6.7: The 8-figure accuracy on the number of failures indicates that you really did test this. Why would it be necessary to test such a trivial program, especially when the bug is easily seen by inspection? End Quick Quiz
The straightforward way to count accurately is to use atomic operations,
as shown in
Figure (count_atomic.c).
Line 1 defines an atomic variable, line 5 atomically increments it, and
line 10 reads it out.
Because this is atomic, it keeps perfect count.
However, it is slower: on a Intel Core Duo laptop, it is about
six times slower than non-atomic increment
when a single thread is incrementing, and more than ten times
slower if two threads are incrementing.
This poor performance should not be a surprise, given the discussion in
Chapter ,
nor should it be a surprise that the performance of atomic increment
gets slower as the number of CPUs and threads increase, as shown in
Figure
.
In this figure, the horizontal dashed line resting on the x axis
is the ideal performance that would be achieved
by a perfectly scalable algorithm: with such an algorithm, a given
increment would incur the same overhead that it would in a single-threaded
program.
Atomic increment of a single global variable is clearly
decidedly non-ideal, and gets worse as you add CPUs.
Quick Quiz 6.8:
Why doesn't the dashed line on the x axis meet the
diagonal line at ?
End Quick Quiz
Quick Quiz 6.9: But atomic increment is still pretty fast. And incrementing a single variable in a tight loop sounds pretty unrealistic to me, after all, most of the program's execution should be devoted to actually doing work, not accounting for the work it has done! Why should I care about making this go faster? End Quick Quiz
For another perspective on global atomic increment, consider
Figure .
In order for each CPU to get a chance to increment a given
global variable, the cache line containing that variable must
circulate among all the CPUs, as shown by the red arrows.
Such circulation will take significant time, resulting in
the poor performance seen in
Figure
.
The following sections discuss high-performance counting, which avoids the delays inherent in such circulation.
Quick Quiz 6.10: But why can't CPU designers simply ship the operation to the data, avoiding the need to circulate the cache line containing the global variable being incremented? End Quick Quiz
Paul E. McKenney 2011-12-16