7.3.5 Locking Granularity and Performance

This section looks at locking granularity and performance from a mathematical synchronization-efficiency viewpoint. Readers who are uninspired by mathematics might choose to skip this section.

The approach is to use a crude queueing model for the efficiency of synchronization mechanism that operate on a single shared global variable, based on an M/M/1 queue. M/M/1 queuing models are based on an exponentially distributed ``inter-arrival rate'' $\lambda$ and an exponentially distributed ``service rate'' $\mu$. The inter-arrival rate $\lambda$ can be thought of as the average number of synchronization operations per second that the system would process if the synchronization were free, in other words, $\lambda$ is an inverse measure of the overhead of each non-synchronization unit of work. For example, if each unit of work was a transaction, if each transaction took one millisecond to process, not counting synchronization overhead, then $\lambda$ would be 1,000 transactions per second.

The service rate $\mu$ is defined similarly, but for the average number of synchronization operations per second that the system would process if the overhead of each transaction was zero, and ignoring the fact that CPUs must wait on each other to complete their increment operations, in other words, $\mu$ can be roughly thought of as the synchronization overhead in absence of contention. For example, some recent computer systems are able to do an atomic increment every 25 nanoseconds or so if all CPUs are doing atomic increments in a tight loop.7.7The value of $\mu$ is therefore about 40,000,000 atomic increments per second.

Of course, the value of $\lambda$ increases with increasing numbers of CPUs, as each CPU is capable of processing transactions independently (again, ignoring synchronization):


\begin{displaymath}
\lambda = n \lambda_0
\end{displaymath} (7.1)

where $n$ is the number of CPUs and $\lambda_0$ is the transaction-processing capability of a single CPU. Note that the expected time for a single CPU to execute a single transaction is $1 / \lambda_0$.

Because the CPUs have to ``wait in line'' behind each other to get their chance to increment the single shared variable, we can use the M/M/1 queueing-model expression for the expected total waiting time:


\begin{displaymath}
T = \frac{1}{\mu - \lambda}
\end{displaymath} (7.2)

Substituting the above value of $\lambda$:


\begin{displaymath}
T = \frac{1}{\mu - n \lambda_0}
\end{displaymath} (7.3)

Now, the efficiency is just the ratio of the time required to process a transaction in absence of synchronization to the time required including synchronization:


\begin{displaymath}
e = \frac{1 / \lambda_0}{T + 1 / \lambda_0}
\end{displaymath} (7.4)

Substituting the above value for $T$ and simplifying:


\begin{displaymath}
e = \frac{\frac{\mu}{\lambda_0} - n}{\frac{\mu}{\lambda_0} - (n - 1)}
\end{displaymath} (7.5)

But the value of $\mu / \lambda_0$ is just the ratio of the time required to process the transaction (absent synchronization overhead) to that of the synchronization overhead itself (absent contention). If we call this ratio $f$, we have:


\begin{displaymath}
e = \frac{f - n}{f - (n - 1)}
\end{displaymath} (7.6)

Figure: Synchronization Efficiency
\resizebox{3in}{!}{\includegraphics{SMPdesign/synceff}}

Figure [*] plots the synchronization efficiency $e$ as a function of the number of CPUs/threads $n$ for a few values of the overhead ratio $f$. For example, again using the 25-nanosecond atomic increment, the $f=10$ line corresponds to each CPU attempting an atomic increment every 250 nanoseconds, and the $f=100$ line corresponds to each CPU attempting an atomic increment every 2.5 microseconds, which in turn corresponds to several thousand instructions. Given that each trace drops off sharply with increasing numbers of CPUs or threads, we can conclude that synchronization mechanisms based on atomic manipulation of a single global shared variable will not scale well if used heavily on current commodity hardware. This is a mathematical depiction of the forces leading to the parallel counting algorithms that were discussed in Chapter [*].

The concept of efficiency is useful even in cases having little or no formal synchronization. Consider for example a matrix multiply, in which the columns of one matrix are multiplied (via ``dot product'') by the rows of another, resulting in an entry in a third matrix. Because none of these operations conflict, it is possible to partition the columns of the first matrix among a group of threads, with each thread computing the corresponding columns of the result matrix. The threads can therefore operate entirely independently, with no synchronization overhead whatsoever, as is done in matmul.c. One might therefore expect a parallel matrix multiply to have a perfect efficiency of 1.0.

Figure: Matrix Multiply Efficiency
\resizebox{3in}{!}{\includegraphics{SMPdesign/matmuleff}}

However, Figure [*] tells a different story, especially for a 64-by-64 matrix multiply, which never gets above an efficiency of about 0.7, even when running single-threaded. The 512-by-512 matrix multiply's efficiency is measurably less than 1.0 on as few as 10 threads, and even the 1024-by-1024 matrix multiply deviates noticeably from perfection at a few tens of threads.

Quick Quiz 7.12: How can a single-threaded 64-by-64 matrix multiple possibly have an efficiency of less than 1.0? Shouldn't all of the traces in Figure [*] have efficiency of exactly 1.0 when running on only one thread? End Quick Quiz

Given these inefficiencies, it is worthwhile to look into more-scalable approaches such as the data locking described in Section [*] or the parallel-fastpath approach discussed in the next section.

Quick Quiz 7.13: How are data-parallel techniques going to help with matrix multiply? It is already data parallel!!! End Quick Quiz

Paul E. McKenney 2011-12-16