One possible design for limit counters is to divide the limit of 10,000 by the number of threads, and give each thread a fixed pool of structures. For example, given 100 threads, each thread would manage its own pool of 100 structures. This approach is simple, and in some cases works well, but it does not handle the common case where a given structure is allocated by one thread and freed by another [MS93]. On the one hand, if a given thread takes credit for any structures it frees, then the thread doing most of the allocating runs out of structures, while the threads doing most of the freeing have lots of credits that they cannot use. On the other hand, if freed structures are credited to the CPU that allocated them, it will be necessary for CPUs to manipulate each others' counters, which will require lots of expensive atomic instructions. Furthermore, because structures come in different sizes, rather than supporting inc_count() and dec_count() interfaces, we implement add_count() and sub_count() to allow variable-sized structures to be properly accounted for.
In short, for many important workloads, we cannot fully partition the counter.
However, we can partially partition the counter,
so that in the common case, each thread need only manipulate its own
private state,
while still allowing counts to flow between threads as needed.
The statistical counting scheme discussed in
Section
provides an interesting starting point,
in that it maintains a global counter as well as per-thread counters,
with the aggregate value being the sum of all of these counters,
global along with per-thread.
The key change is to pull each thread's counter into
the global sum while that thread is still running, rather than waiting
for thread exit.
Clearly, we want threads to pull in their own counts, as cross-thread
accesses are expensive and scale poorly.
This leaves open the question of exactly when a given thread's counter should be pulled into the global counter. In the initial implementation, we will start by maintaining a limit on the value of the per-thread counter. When this limit would be exceeded, the thread pulls its counter into the global counter. Of course, we cannot simply add to the counter when a structure is allocated: we must also subtract from the counter when a structure is freed. We must therefore make use of the global counter when a subtraction would otherwise reduce the value of the per-thread counter below zero. However, if the limit is reasonably large, almost all of the addition and subtraction operations should be handled by the per-thread counter, which should give us good performance and scalability.
This design is an example of ``parallel fastpath'', which is an important design pattern in which the common case executes with no expensive instructions and no interactions between threads, but where occasional use is also made of a more conservatively designed global algorithm.
Paul E. McKenney 2011-12-16