6.2.4 Per-Thread-Variable-Based Implementation

Figure: Per-Thread Statistical Counters
\begin{figure}{ \scriptsize
\begin{verbatim}1 long __thread counter = 0;
2 l...
...[idx] = NULL;
41 spin_unlock(&final_mutex);
42 }\end{verbatim}
}\end{figure}

Fortunately, gcc provides an __thread storage class that provides per-thread storage. This can be used as shown in Figure [*] (count_end.c) to implement a statistical counter that not only scales, but that also incurs little or no performance penalty to incrementers compared to simple non-atomic increment.

Lines 1-4 define needed variables: counter is the per-thread counter variable, the counterp[] array allows threads to access each others' counters, finalcount accumulates the total as individual threads exit, and final_mutex coordinates between threads accumulating the total value of the counter and exiting threads.

Quick Quiz 6.19: Why do we need an explicit array to find the other threads' counters? Why doesn't gcc provide a per_thread() interface, similar to the Linux kernel's per_cpu() primitive, to allow threads to more easily access each others' per-thread variables? End Quick Quiz

The inc_count() function used by updaters is quite simple, as can be seen on lines 6-9.

The read_count() function used by readers is a bit more complex. Line 16 acquires a lock to exclude exiting threads, and line 21 releases it. Line 17 initializes the sum to the count accumulated by those threads that have already exited, and lines 18-20 sum the counts being accumulated by threads currently running. Finally, line 22 returns the sum.

Quick Quiz 6.20: Why on earth do we need something as heavyweight as a lock guarding the summation in the function read_count() in Figure [*]? End Quick Quiz

Lines 25-32 show the count_register_thread() function, which must be called by each thread before its first use of this counter. This function simply sets up this thread's element of the counterp[] array to point to its per-thread counter variable.

Quick Quiz 6.21: Why on earth do we need to acquire the lock in count_register_thread() in Figure [*]? It is a single properly aligned machine-word store to a location that no other thread is modifying, so it should be atomic anyway, right? End Quick Quiz

Lines 34-42 show the count_unregister_thread() function, which must be called prior to exit by each thread that previously called count_register_thread(). Line 38 acquires the lock, and line 41 releases it, thus excluding any calls to read_count() as well as other calls to count_unregister_thread(). Line 39 adds this thread's counter to the global finalcount, and then NULLs out its counterp[] array entry. A subsequent call to read_count() will see the exiting thread's count in the global finalcount, and will skip the exiting thread when sequencing through the counterp[] array, thus obtaining the correct total.

This approach gives updaters almost exactly the same performance as a non-atomic add, and also scales linearly. On the other hand, concurrent reads contend for a single global lock, and therefore perform poorly and scale abysmally. However, this is not a problem for statistical counters, where incrementing happens often and readout happens almost never. In addition, this approach is considerably more complex than the array-based scheme, due to the fact that a given thread's per-thread variables vanish when that thread exits.

Quick Quiz 6.22: Fine, but the Linux kernel doesn't have to acquire a lock when reading out the aggregate value of per-CPU counters. So why should user-space code need to do this??? End Quick Quiz

Paul E. McKenney 2011-12-16