F.2 Chapter [*]

Quick Quiz [*].1: 
Why should parallel programmers bother learning low-level properties of the hardware? Wouldn't it be easier, better, and more general to remain at a higher level of abstraction?
 
Answer:
It might well be easier to ignore the detailed properties of the hardware, but in most cases it would be quite foolish to do so. If you accept that the only purpose of parallelism is to increase performance, and if you further accept that performance depends on detailed properties of the hardware, then it logically follows that parallel programmers are going to need to know at least a few hardware properties.

This is the case in most engineering disciplines. Would you want to use a bridge designed by an engineer who did not understand the properties of the concrete and steel making up that bridge? If not, why would you expect a parallel programmer to be able to develop competent parallel software without at least some understanding of the underlying hardware?

Quick Quiz [*].2: 
What types of machines would allow atomic operations on multiple data elements?
 
Answer:
One answer to this question is that it is often possible to pack multiple elements of data into a single machine word, which can then be manipulated atomically.

A more trendy answer would be machines supporting transactional memory [Lom77]. However, such machines are still (as of 2008) research curiosities. The jury is still out on the applicability of transactional memory [MMW07,PW07,RHP+07].

Quick Quiz [*].3: 
This is a simplified sequence of events? How could it possibly be any more complex?
 
Answer:
This sequence ignored a number of possible complications, including:

  1. Other CPUs might be concurrently attempting to perform CAS operations involving this same cacheline.
  2. The cacheline might have been replicated read-only in several CPUs' caches, in which case, it would need to be flushed from their caches.
  3. CPU 7 might have been operating on the cache line when the request for it arrived, in which case CPU 7 would need to hold of the request until its own operation completed.
  4. CPU 7 might have ejected the cacheline from its cache (for example, in order to make room for other data), so that by the time that the request arrived, the cacheline was on its way to memory.
  5. A correctable error might have occurred in the cacheline, which would then need to be corrected at some point before the data was used.

Production-quality cache-coherence mechanisms are extremely complicated due to these sorts of considerations.

Quick Quiz [*].4: 
Why is it necessary to flush the cacheline from CPU 7's cache?
 
Answer:
If the cacheline was not flushed from CPU 7's cache, then CPUs 0 and 7 might have different values for the same set of variables in the cacheline. This sort of incoherence would greatly complicate parallel software, and so hardware architects have been convinced to avoid it.

Quick Quiz [*].5: 
Surely the hardware designers could be persuaded to improve this situation! Why have they been content with such abysmal performance for these single-instruction operations?
 
Answer:
The hardware designers have been working on this problem, and have consulted with no less a luminary than the physicist Stephen Hawking. Hawking's observation was that the hardware designers have two basic problems [Gar07]:

  1. the finite speed of light, and
  2. the atomic nature of matter.


Table: Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System
Operation Cost (ns) Ratio
Clock period 0.4 1.0
``Best-case'' CAS 12.2 33.8
Best-case lock 25.6 71.2
Single cache miss 12.9 35.8
CAS cache miss 7.0 19.4
Off-Core    
Single cache miss 31.2 86.6
CAS cache miss 31.2 86.5
Off-Socket    
Single cache miss 92.4 256.7
CAS cache miss 95.9 266.4
Comms Fabric 4,500 7,500
Global Comms 195,000,000 324,000,000


The first problem limits raw speed, and the second limits miniaturization, which in turn limits frequency. And even this sidesteps the power-consumption issue that is currently holding production frequencies to well below 10 GHz.

Nevertheless, some progress is being made, as may be seen by comparing Table [*] with Table [*] on page [*]. Integration of hardware threads in a single core and multiple cores on a die have improved latencies greatly, at least within the confines of a single core or single die. There has been some improvement in overall system latency, but only by about a factor of two. Unfortunately, neither the speed of light nor the atomic nature of matter has changed much in the past few years.

Section [*] looks at what else hardware designers might be able to do to ease the plight of parallel programmers.

Quick Quiz [*].6: 
These numbers are insanely large! How can I possibly get my head around them?
 
Answer:
Get a roll of toilet paper. In the USA, each roll will normally have somewhere around 350-500 sheets. Tear off one sheet to represent a single clock cycle, setting it aside. Now unroll the rest of the roll.

The resulting pile of toilet paper will likely represent a single CAS cache miss.

For the more-expensive inter-system communications latencies, use several rolls (or multiple cases) of toilet paper to represent the communications latency.

Important safety tip: make sure to account for the needs of those you live with when appropriating toilet paper!

Quick Quiz [*].7: 
Given that distributed-systems communication is so horribly expensive, why does anyone bother with them?
 
Answer:
There are a number of reasons:

  1. Shared-memory multiprocessor systems have strict size limits. If you need more than a few thousand CPUs, you have no choice but to use a distributed system.
  2. Extremely large shared-memory systems tend to be quite expensive and to have even longer cache-miss latencies than does the small four-CPU system shown in Table [*].
  3. The distributed-systems communications latencies do not necessarily consume the CPU, which can often allow computation to proceed in parallel with message transfer.
  4. Many important problems are ``embarrassingly parallel'', so that extremely large quantities of processing may be enabled by a very small number of messages. SETI@HOME [aCB08] is but one example of such an application. These sorts of applications can make good use of networks of computers despite extremely long communications latencies.

It is likely that continued work on parallel applications will increase the number of embarrassingly parallel applications that can run well on machines and/or clusters having long communications latencies. That said, greatly reduced hardware latencies would be an extremely welcome development.

Paul E. McKenney 2011-12-16