One way to provide per-thread variables is to allocate an array with one element per thread (presumably cache aligned and padded to avoid false sharing).
Quick Quiz 6.12: An array??? But doesn't that limit the number of threads? End Quick Quiz
Such an array can be wrapped into per-thread primitives, as shown in
Figure
(count_stat.c).
Line 1 defines an array containing a set of per-thread counters of
type long named, creatively enough, counter.
Lines 3-6 show a function that increments the counters, using the __get_thread_var() primitive to locate the currently running thread's element of the counter array. Because this element is modified only by the corresponding thread, non-atomic increment suffices.
Lines 8-16 show a function that reads out the aggregate value of the counter, using the for_each_thread() primitive to iterate over the list of currently running threads, and using the per_thread() primitive to fetch the specified thread's counter. Because the hardware can fetch and store a properly aligned long atomically, and because gcc is kind enough to make use of this capability, normal loads suffice, and no special atomic instructions are required.
Quick Quiz 6.13: What other choice does gcc have, anyway??? End Quick Quiz
Quick Quiz 6.14:
How does the per-thread counter variable in
Figure
get initialized?
End Quick Quiz
Quick Quiz 6.15:
How is the code in
Figure
supposed to permit more than one counter?
End Quick Quiz
This approach scales linearly with increasing number of updater threads
invoking inc_count().
As is shown by the green arrows in
Figure ,
the reason for this is that each CPU can make rapid progress incrementing
its thread's variable, with no expensive communication required crossing
the full diameter of the computer system.
However, this excellent update-side scalability comes at great read-side
expense for large numbers of threads.
The next section shows one way to reduce read-side expense while
still retaining the update-side scalability.
Paul E. McKenney 2011-12-16