F.8 Chapter [*]

Quick Quiz [*].1: 
Why on earth did we need that global lock in the first place?
 
Answer:
A given thread's __thread variables vanish when that thread exits. It is therefore necessary to synchronize any operation that accesses other threads' __thread variables with thread exit. Without such synchronization, accesses to __thread variable of a just-exited thread will result in segmentation faults.

Quick Quiz [*].2: 
Just what is the accuracy of read_count(), anyway?
 
Answer:
Refer to Figure [*] on Page [*]. Clearly, if there are no concurrent invocations of inc_count(), read_count() will return an exact result. However, if there are concurrent invocations of inc_count(), then the sum is in fact changing as read_count() performs its summation. That said, because thread creation and exit are excluded by final_mutex, the pointers in counterp remain constant.

Let's imagine a mythical machine that is able to take an instantaneous snapshot of its memory. Suppose that this machine takes such a snapshot at the beginning of read_count()'s execution, and another snapshot at the end of read_count()'s execution. Then read_count() will access each thread's counter at some time between these two snapshots, and will therefore obtain a result that is bounded by those of the two snapshots, inclusive. The overall sum will therefore be bounded by the pair of sums that would have been obtained from each of the two snapshots (again, inclusive).

The expected error is therefore half of the difference between the pair of sums that would have been obtained from each of the two snapshots, that is to say, half of the execution time of read_count() multiplied by the number of expected calls to inc_count() per unit time.

Or, for those who prefer equations:

\begin{displaymath}
\epsilon = \frac{T_r R_i}{2}
\end{displaymath} (F.1)

where $\epsilon$ is the expected error in read_count()'s return value, $T_r$ is the time that read_count() takes to execute, and $R_i$ is the rate of inc_count() calls per unit time. (And of course, $T_r$ and $R_i$ should use the same units of time: microseconds and calls per microsecond, seconds and calls per second, or whatever, as long as they are the same units.)

Quick Quiz [*].3: 
Hey!!! Line 45 of Figure [*] modifies a value in a pre-existing countarray structure! Didn't you say that this structure, once made available to read_count(), remained constant???
 
Answer:
Indeed I did say that. And it would be possible to make count_register_thread() allocate a new structure, much as count_unregister_thread() currently does.

But this is unnecessary. Recall the derivation of the error bounds of read_count() that was based on the snapshots of memory. Because new threads start with initial counter values of zero, the derivation holds even if we add a new thread partway through read_count()'s execution. So, interestingly enough, when adding a new thread, this implementation gets the effect of allocating a new structure, but without actually having to do the allocation.

Quick Quiz [*].4: 
Wow! Figure [*] contains 69 lines of code, compared to only 42 in Figure [*]. Is this extra complexity really worth it?
 
Answer:
This of course needs to be decided on a case-by-case basis. If you need an implementation of read_count() that scales linearly, then the lock-based implementation shown in Figure [*] simply will not work for you. On the other hand, if calls to count_read() are sufficiently rare, then the lock-based version is simpler and might thus be better, although much of the size difference is due to the structure definition, memory allocation, and NULL return checking.

Of course, a better question is "why doesn't the language implement cross-thread access to __thread variables?" After all, such an implementation would make both the locking and the use of RCU unnecessary. This would in turn enable an implementation that was even simpler than the one shown in Figure [*], but with all the scalability and performance benefits of the implementation shown in Figure [*]!

Paul E. McKenney 2011-12-16