6.6 Parallel Counting Discussion

This chapter has presented the reliability, performance, and scalability problems with traditional counting primitives. The C-language ++ operator is not guaranteed to function reliably in multithreaded code, and atomic operations to a single variable neither perform nor scale well. This chapter has also presented a number of counting algorithms that perform and scale extremely well in certain special cases.


Table: Statistical Counter Performance on Power 5
      Reads
Algorithm Section Updates 1 Core 64 Cores
count_stat.c [*] 40.4 ns 220 ns 220 ns
count_end.c [*] 6.7 ns 521 ns 205,000 ns
count_end_rcu.c [*] 6.7 ns 481 ns 3,700 ns


Table [*] shows the performance of the three parallel statistical counting algorithms. All three algorithms provide perfect linear scalability for updates. The per-thread-variable implementation is significantly faster on updates than the array-based implementation, but is slower at reads, and suffers severe lock contention when there are many parallel readers. This contention can be addressed using techniques introduced in Chapter [*], as shown on the last row of Table [*].

Quick Quiz 6.49: On the count_stat.c row of Table [*], we see that the update side scales linearly with the number of threads. How is that possible given that the more threads there are, the more per-thread counters must be summed up? End Quick Quiz

Quick Quiz 6.50: Even on the last row of Table [*], the read-side performance of these statistical counter implementations is pretty horrible. So why bother with them? End Quick Quiz


Table: Limit Counter Performance on Power 5
        Reads
Algorithm Section Exact? Updates 1 Core 64 Cores
count_lim.c [*] N 9.7 ns 517 ns 202,000 ns
count_lim_app.c [*] N 6.6 ns 520 ns 205,000 ns
count_lim_atomic.c [*] Y 56.1 ns 606 ns 166,000 ns
count_lim_sig.c [*] Y 17.5 ns 520 ns 205,000 ns


Figure [*] shows the performance of the parallel limit-counting algorithms. Exact enforcement of the limits incurs a substantial performance penalty, although on the Power 5 system this penalty can be reduced by substituting read-side signals for update-side atomic operations. All of these implementations suffer from read-side lock contention in the face of concurrent readers.

Quick Quiz 6.51: Given the performance data shown in Table [*], we should always prefer update-side signals over read-side atomic operations, right? End Quick Quiz

Quick Quiz 6.52: Can advanced techniques be applied to address the lock contention for readers seen in Table [*]? End Quick Quiz

The fact that these algorithms only work well in their respective special cases might be considered a major problem with parallel programming in general. After all, the C-language ++ operator works just fine in single-threaded code, and not just for special cases, but in general, right?

This line of reasoning does contain a grain of truth, but is in essence misguided. The problem is not parallelism as such, but rather scalability. To understand this, first consider the C-language ++ operator. The fact is that it does not work in general, only for a restricted range of numbers. If you need to deal with 1,000-digit decimal numbers, the C-language ++ operator will not work for you.

Quick Quiz 6.53: The ++ operator works just fine for 1,000-digit numbers! Haven't you heard of operator overloading??? End Quick Quiz

This problem is not specific to arithmetic. Suppose you need to store and query data. Should you use an ASCII file, XML, a relational database, a linked list, a dense array, a B-tree, a radix tree, or any of the plethora of other data structures and environments that permit data to be stored and queried? It depends on what you need to do, how fast you need it done, and how large your data set is.

Similarly, if you need to count, your solution will depend on how large of numbers you need to work with, how many CPUs need to be manipulating a given number concurrently, how the number is to be used, and what level of performance and scalability you will need.

Nor is this problem specific to software. The design for a bridge meant to allow people to walk across a small brook might be a simple as a plank thrown across the brook. But this solution of using a plank does not scale. You would probably not use a plank to span the kilometers-wide mouth of the Columbia River, nor would such a design be advisable for bridges carrying concrete trucks. In short, just as bridge design must change with increasing span and load, so must software design change as the number of CPUs increases.

The examples in this chapter have shown that an important tool permitting large numbers of CPUs to be brought to bear is partitioning. Whether fully partitioned, as in the statistical counters discussed in Section [*], or partially partitioned as in the limit counters discussed in Sections [*] and [*]. Partitioning will be considered in far greater depth in the next chapter.

Quick Quiz 6.54: But if we are going to have to partition everything, why bother with shared-memory multithreading? Why not just partition the problem completely and run as multiple processes, each in its own address space? End Quick Quiz

Paul E. McKenney 2011-12-16