This section looks at locking granularity and performance from a mathematical synchronization-efficiency viewpoint. Readers who are uninspired by mathematics might choose to skip this section.
The approach is to use a crude queueing model for the efficiency of
synchronization mechanism that operate on a single shared global
variable, based on an M/M/1 queue.
M/M/1 queuing models are based on an exponentially distributed
``inter-arrival rate'' and an exponentially distributed
``service rate''
.
The inter-arrival rate
can be thought of as the average
number of synchronization operations per second that the system
would process if the synchronization were free, in other words,
is an inverse measure of the overhead of each non-synchronization
unit of work.
For example, if each unit of work was a transaction, if each transaction
took one millisecond to process, not counting synchronization overhead,
then
would be 1,000 transactions per second.
The service rate is defined similarly, but for the average
number of synchronization operations per second that the system
would process if the overhead of each transaction was zero, and
ignoring the fact that CPUs must wait on each other to complete
their increment operations, in other words,
can be roughly
thought of as the synchronization overhead in absence of contention.
For example, some recent computer systems are able to do
an atomic increment every 25 nanoseconds or so if all CPUs are
doing atomic increments in a tight loop.7.7The value of
is therefore about 40,000,000 atomic increments
per second.
Of course, the value of increases with increasing numbers of
CPUs, as each CPU is capable of processing transactions independently
(again, ignoring synchronization):
![]() |
(7.1) |
where is the number of CPUs and
is the transaction-processing
capability of a single CPU.
Note that the expected time for a single CPU to execute a single transaction
is
.
Because the CPUs have to ``wait in line'' behind each other to get their chance to increment the single shared variable, we can use the M/M/1 queueing-model expression for the expected total waiting time:
![]() |
(7.2) |
Substituting the above value of :
![]() |
(7.3) |
Now, the efficiency is just the ratio of the time required to process a transaction in absence of synchronization to the time required including synchronization:
![]() |
(7.4) |
Substituting the above value for and simplifying:
![]() |
(7.5) |
But the value of
is just the ratio of the time required
to process the transaction (absent synchronization overhead) to that of
the synchronization overhead itself (absent contention).
If we call this ratio
, we have:
![]() |
(7.6) |
Figure plots the synchronization
efficiency
as a function of the number of CPUs/threads
for
a few values of the overhead ratio
.
For example, again using the 25-nanosecond atomic increment, the
line corresponds to each CPU attempting an atomic increment
every 250 nanoseconds, and the
line corresponds to each
CPU attempting an atomic increment every 2.5 microseconds,
which in turn corresponds to several thousand instructions.
Given that each trace drops off sharply with increasing numbers of
CPUs or threads, we can conclude that
synchronization mechanisms based on
atomic manipulation of a single global shared variable will not
scale well if used heavily on current commodity hardware.
This is a mathematical depiction of the forces leading to the parallel
counting algorithms that were discussed in Chapter
.
The concept of efficiency is useful even in cases having little or no formal synchronization. Consider for example a matrix multiply, in which the columns of one matrix are multiplied (via ``dot product'') by the rows of another, resulting in an entry in a third matrix. Because none of these operations conflict, it is possible to partition the columns of the first matrix among a group of threads, with each thread computing the corresponding columns of the result matrix. The threads can therefore operate entirely independently, with no synchronization overhead whatsoever, as is done in matmul.c. One might therefore expect a parallel matrix multiply to have a perfect efficiency of 1.0.
However,
Figure
tells a different story, especially for a 64-by-64 matrix multiply,
which never gets above an efficiency of about 0.7, even when running
single-threaded.
The 512-by-512 matrix multiply's efficiency is measurably less
than 1.0 on as few as 10 threads, and even the 1024-by-1024 matrix
multiply deviates noticeably from perfection at a few tens of threads.
Quick Quiz 7.12:
How can a single-threaded 64-by-64 matrix multiple possibly
have an efficiency of less than 1.0?
Shouldn't all of the traces in
Figure
have efficiency of exactly 1.0 when running on only one thread?
End Quick Quiz
Given these inefficiencies,
it is worthwhile to look into more-scalable approaches
such as the data locking described in
Section
or the parallel-fastpath approach discussed in the next section.
Quick Quiz 7.13: How are data-parallel techniques going to help with matrix multiply? It is already data parallel!!! End Quick Quiz
Paul E. McKenney 2011-12-16