4.2.2 Costs of Operations


Table: Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System
Operation Cost (ns) Ratio
Clock period 0.6 1.0
Best-case CAS 37.9 63.2
Best-case lock 65.6 109.3
Single cache miss 139.5 232.5
CAS cache miss 306.0 510.0
Comms Fabric 3,000 5,000
Global Comms 130,000,000 216,000,000


The overheads of some common operations important to parallel programs are displayed in Table [*]. This system's clock period rounds to 0.6ns. Although it is not unusual for modern microprocessors to be able to retire multiple instructions per clock period, the operations will be normalized to a full clock period in the third column, labeled ``Ratio''. The first thing to note about this table is the large values of many of the ratios.

The best-case CAS operation consumes almost forty nanoseconds, a duration more than sixty times that of the clock period. Here, ``best case'' means that the same CPU now performing the CAS operation on a given variable was the last CPU to operate on this variable, so that the corresponding cache line is already held in that CPU's cache, Similarly, the best-case lock operation (a ``round trip'' pair consisting of a lock acquisition followed by a lock release) consumes more than sixty nanoseconds, or more than one hundred clock cycles. Again, ``best case'' means that the data structure representing the lock is already in the cache belonging to the CPU acquiring and releasing the lock. The lock operation is more expensive than CAS because it requires two atomic operations on the lock data structure.

An operation that misses the cache consumes almost one hundred and forty nanoseconds, or more than two hundred clock cycles. A CAS operation, which must look at the old value of the variable as well as store a new value, consumes over three hundred nanoseconds, or more than five hundred clock cycles. Think about this a bit. In the time required to do one CAS operation, the CPU could have executed more than five hundred normal instructions. This should demonstrate the limitations of fine-grained locking.

Quick Quiz 4.5: Surely the hardware designers could be persuaded to improve this situation! Why have they been content with such abysmal performance for these single-instruction operations? End Quick Quiz

I/O operations are even more expensive. A high performance (and expensive!) communications fabric, such as InfiniBand or any number of proprietary interconnects, has a latency of roughly three microseconds, during which time five thousand instructions might have been executed. Standards-based communications networks often require some sort of protocol processing, which further increases the latency. Of course, geographic distance also increases latency, with the theoretical speed-of-light latency around the world coming to roughly 130 milliseconds, or more than 200 million clock cycles.

Quick Quiz 4.6: These numbers are insanely large! How can I possibly get my head around them? End Quick Quiz

Paul E. McKenney 2011-12-16