6.1 Why Isn't Concurrent Counting Trivial?

Figure: Just Count!
\begin{figure}{ \scriptsize
\begin{verbatim}1 long counter = 0;
2
3 void i...
...ng read_count(void)
9 {
10 return counter;
11 }\end{verbatim}
}\end{figure}

Let's start with something simple, for example, the straightforward use of arithmetic shown in Figure [*] (count_nonatomic.c). Here, we have a counter on line 1, we increment it on line 5, and we read out its value on line 10. What could be simpler?

This approach has the additional advantage of being blazingly fast if you are doing lots of reading and almost no incrementing, and on small systems, the performance is excellent.

There is just one large fly in the ointment: this approach can lose counts. On my dual-core laptop, a short run invoked inc_count() 100,014,000 times, but the final value of the counter was only 52,909,118. Although it is true that approximate values have their place in computing, it is almost always necessary to do better than this.

Quick Quiz 6.6: But doesn't the ++ operator produce an x86 add-to-memory instruction? And won't the CPU cache cause this to be atomic? End Quick Quiz

Quick Quiz 6.7: The 8-figure accuracy on the number of failures indicates that you really did test this. Why would it be necessary to test such a trivial program, especially when the bug is easily seen by inspection? End Quick Quiz

Figure: Just Count Atomically!
\begin{figure}{ \scriptsize
\begin{verbatim}1 atomic_t counter = ATOMIC_INIT(...
...void)
9 {
10 return atomic_read(&counter);
11 }\end{verbatim}
}\end{figure}

Figure: Atomic Increment Scalability on Nehalem
\resizebox{3in}{!}{\includegraphics{CodeSamples/count/atomic}}

The straightforward way to count accurately is to use atomic operations, as shown in Figure [*] (count_atomic.c). Line 1 defines an atomic variable, line 5 atomically increments it, and line 10 reads it out. Because this is atomic, it keeps perfect count. However, it is slower: on a Intel Core Duo laptop, it is about six times slower than non-atomic increment when a single thread is incrementing, and more than ten times slower if two threads are incrementing.

This poor performance should not be a surprise, given the discussion in Chapter [*], nor should it be a surprise that the performance of atomic increment gets slower as the number of CPUs and threads increase, as shown in Figure [*]. In this figure, the horizontal dashed line resting on the x axis is the ideal performance that would be achieved by a perfectly scalable algorithm: with such an algorithm, a given increment would incur the same overhead that it would in a single-threaded program. Atomic increment of a single global variable is clearly decidedly non-ideal, and gets worse as you add CPUs.

Quick Quiz 6.8: Why doesn't the dashed line on the x axis meet the diagonal line at $y=1$? End Quick Quiz

Quick Quiz 6.9: But atomic increment is still pretty fast. And incrementing a single variable in a tight loop sounds pretty unrealistic to me, after all, most of the program's execution should be devoted to actually doing work, not accounting for the work it has done! Why should I care about making this go faster? End Quick Quiz

Figure: Data Flow For Global Atomic Increment
\resizebox{3in}{!}{\includegraphics{count/GlobalInc}}

For another perspective on global atomic increment, consider Figure [*]. In order for each CPU to get a chance to increment a given global variable, the cache line containing that variable must circulate among all the CPUs, as shown by the red arrows. Such circulation will take significant time, resulting in the poor performance seen in Figure [*].

The following sections discuss high-performance counting, which avoids the delays inherent in such circulation.

Quick Quiz 6.10: But why can't CPU designers simply ship the operation to the data, avoiding the need to circulate the cache line containing the global variable being incremented? End Quick Quiz

Paul E. McKenney 2011-12-16