This chapter has presented the reliability, performance, and scalability problems with traditional counting primitives. The C-language ++ operator is not guaranteed to function reliably in multithreaded code, and atomic operations to a single variable neither perform nor scale well. This chapter has also presented a number of counting algorithms that perform and scale extremely well in certain special cases.
|
Table
shows the performance of the three parallel statistical counting
algorithms.
All three algorithms provide perfect linear scalability for updates.
The per-thread-variable implementation is significantly faster on
updates than the array-based implementation, but is slower at reads,
and suffers severe lock contention when there are many parallel readers.
This contention can be addressed using techniques introduced in
Chapter
,
as shown on the last row of
Table
.
Quick Quiz 6.49:
On the count_stat.c row of
Table ,
we see that the update side scales linearly with the number of
threads.
How is that possible given that the more threads there are,
the more per-thread counters must be summed up?
End Quick Quiz
Quick Quiz 6.50:
Even on the last row of
Table ,
the read-side performance of these statistical counter
implementations is pretty horrible.
So why bother with them?
End Quick Quiz
|
Figure
shows the performance of the parallel limit-counting algorithms.
Exact enforcement of the limits incurs a substantial performance
penalty, although on the Power 5 system this penalty can be reduced
by substituting read-side signals for update-side atomic operations.
All of these implementations suffer from read-side lock contention
in the face of concurrent readers.
Quick Quiz 6.51:
Given the performance data shown in
Table ,
we should always prefer update-side signals over read-side
atomic operations, right?
End Quick Quiz
Quick Quiz 6.52:
Can advanced techniques be applied to address the lock
contention for readers seen in
Table ?
End Quick Quiz
The fact that these algorithms only work well in their respective special cases might be considered a major problem with parallel programming in general. After all, the C-language ++ operator works just fine in single-threaded code, and not just for special cases, but in general, right?
This line of reasoning does contain a grain of truth, but is in essence misguided. The problem is not parallelism as such, but rather scalability. To understand this, first consider the C-language ++ operator. The fact is that it does not work in general, only for a restricted range of numbers. If you need to deal with 1,000-digit decimal numbers, the C-language ++ operator will not work for you.
Quick Quiz 6.53: The ++ operator works just fine for 1,000-digit numbers! Haven't you heard of operator overloading??? End Quick Quiz
This problem is not specific to arithmetic. Suppose you need to store and query data. Should you use an ASCII file, XML, a relational database, a linked list, a dense array, a B-tree, a radix tree, or any of the plethora of other data structures and environments that permit data to be stored and queried? It depends on what you need to do, how fast you need it done, and how large your data set is.
Similarly, if you need to count, your solution will depend on how large of numbers you need to work with, how many CPUs need to be manipulating a given number concurrently, how the number is to be used, and what level of performance and scalability you will need.
Nor is this problem specific to software. The design for a bridge meant to allow people to walk across a small brook might be a simple as a plank thrown across the brook. But this solution of using a plank does not scale. You would probably not use a plank to span the kilometers-wide mouth of the Columbia River, nor would such a design be advisable for bridges carrying concrete trucks. In short, just as bridge design must change with increasing span and load, so must software design change as the number of CPUs increases.
The examples in this chapter have shown that an important tool permitting
large numbers of CPUs to be brought to bear is partitioning.
Whether fully partitioned, as in the statistical counters discussed in
Section ,
or partially partitioned as in the limit counters discussed in
Sections
and
.
Partitioning will be considered in far greater depth in the next chapter.
Quick Quiz 6.54: But if we are going to have to partition everything, why bother with shared-memory multithreading? Why not just partition the problem completely and run as multiple processes, each in its own address space? End Quick Quiz
Paul E. McKenney 2011-12-16