6.2.2 Array-Based Implementation

One way to provide per-thread variables is to allocate an array with one element per thread (presumably cache aligned and padded to avoid false sharing).

Quick Quiz 6.12: An array??? But doesn't that limit the number of threads? End Quick Quiz

Figure: Array-Based Per-Thread Statistical Counters
\begin{figure}{ \scriptsize
\begin{verbatim}1 DEFINE_PER_THREAD(long, counter...
...m += per_thread(counter, t);
15 return sum;
16 }\end{verbatim}
}\end{figure}

Such an array can be wrapped into per-thread primitives, as shown in Figure [*] (count_stat.c). Line 1 defines an array containing a set of per-thread counters of type long named, creatively enough, counter.

Lines 3-6 show a function that increments the counters, using the __get_thread_var() primitive to locate the currently running thread's element of the counter array. Because this element is modified only by the corresponding thread, non-atomic increment suffices.

Lines 8-16 show a function that reads out the aggregate value of the counter, using the for_each_thread() primitive to iterate over the list of currently running threads, and using the per_thread() primitive to fetch the specified thread's counter. Because the hardware can fetch and store a properly aligned long atomically, and because gcc is kind enough to make use of this capability, normal loads suffice, and no special atomic instructions are required.

Quick Quiz 6.13: What other choice does gcc have, anyway??? End Quick Quiz

Quick Quiz 6.14: How does the per-thread counter variable in Figure [*] get initialized? End Quick Quiz

Quick Quiz 6.15: How is the code in Figure [*] supposed to permit more than one counter? End Quick Quiz

Figure: Data Flow For Per-Thread Increment
\resizebox{3in}{!}{\includegraphics{count/PerThreadInc}}

This approach scales linearly with increasing number of updater threads invoking inc_count(). As is shown by the green arrows in Figure [*], the reason for this is that each CPU can make rapid progress incrementing its thread's variable, with no expensive communication required crossing the full diameter of the computer system. However, this excellent update-side scalability comes at great read-side expense for large numbers of threads. The next section shows one way to reduce read-side expense while still retaining the update-side scalability.

Paul E. McKenney 2011-12-16