This is the case in most engineering disciplines. Would you want to use a bridge designed by an engineer who did not understand the properties of the concrete and steel making up that bridge? If not, why would you expect a parallel programmer to be able to develop competent parallel software without at least some understanding of the underlying hardware?
Quick Quiz .2:
What types of machines would allow atomic operations on
multiple data elements?
Answer:
One answer to this question is that it is often possible to
pack multiple elements of data into a single machine word,
which can then be manipulated atomically.
A more trendy answer would be machines supporting transactional memory [Lom77]. However, such machines are still (as of 2008) research curiosities. The jury is still out on the applicability of transactional memory [MMW07,PW07,RHP+07].
Quick Quiz .3:
This is a simplified sequence of events?
How could it possibly be any more complex?
Answer:
This sequence ignored a number of possible complications,
including:
Production-quality cache-coherence mechanisms are extremely complicated due to these sorts of considerations.
Quick Quiz .4:
Why is it necessary to flush the cacheline from CPU 7's cache?
Answer:
If the cacheline was not flushed from CPU 7's cache, then
CPUs 0 and 7 might have different values for the same set
of variables in the cacheline.
This sort of incoherence would greatly complicate parallel
software, and so hardware architects have been convinced to
avoid it.
Quick Quiz .5:
Surely the hardware designers could be persuaded to improve
this situation!
Why have they been content with such abysmal performance
for these single-instruction operations?
Answer:
The hardware designers have been working on this
problem, and have consulted with no less a luminary than
the physicist Stephen Hawking.
Hawking's observation was that the hardware designers have
two basic problems [Gar07]:
The first problem limits raw speed, and the second limits miniaturization, which in turn limits frequency. And even this sidesteps the power-consumption issue that is currently holding production frequencies to well below 10 GHz.
Nevertheless, some progress is being made, as may be seen
by comparing
Table
with
Table
on
page
.
Integration of hardware threads in a single core and multiple
cores on a die have improved latencies greatly, at least within the
confines of a single core or single die.
There has been some improvement in overall system latency,
but only by about a factor of two.
Unfortunately, neither the speed of light nor the atomic nature
of matter has changed much in the past few years.
Section
looks at what else hardware designers might be
able to do to ease the plight of parallel programmers.
Quick Quiz .6:
These numbers are insanely large!
How can I possibly get my head around them?
Answer:
Get a roll of toilet paper.
In the USA, each roll will normally have somewhere around 350-500
sheets.
Tear off one sheet to represent a single clock cycle, setting it aside.
Now unroll the rest of the roll.
The resulting pile of toilet paper will likely represent a single CAS cache miss.
For the more-expensive inter-system communications latencies, use several rolls (or multiple cases) of toilet paper to represent the communications latency.
Important safety tip: make sure to account for the needs of those you live with when appropriating toilet paper!
Quick Quiz .7:
Given that distributed-systems communication is so horribly
expensive, why does anyone bother with them?
Answer:
There are a number of reasons:
It is likely that continued work on parallel applications will increase the number of embarrassingly parallel applications that can run well on machines and/or clusters having long communications latencies. That said, greatly reduced hardware latencies would be an extremely welcome development.
Paul E. McKenney 2011-12-16