- ... sanity.2.1
-
Or, perhaps more accurately, without much greater risk to your
sanity than that incurred by non-parallel programming.
Which, come to think of it, might not be saying all that much.
Either way, Appendix
discusses
some important questions whose answers are less intuitive in
parallel programs than in sequential program.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... CPUs.3.1
-
This plot shows clock frequencies for newer CPUs theoretically
capable of retiring one or more instructions per clock, and MIPS for
older CPUs requiring multiple clocks to execute even the
simplest instruction.
The reason for taking this approach is that the newer CPUs'
ability to retire multiple instructions per clock is typically
limited by memory-system performance.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... optimizations.3.2
-
Of course, if you are a hobbyist whose primary interest is
writing parallel software, that is more than enough reason to
parallelize whatever software you are interested in.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... functions.7.1
-
One could easily create a polymorphic implementation in any
number of languages, but doing so is left as an exercise for
the reader.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... non-interruptible7.2
-
Either by masking interrupts or by being oblivious to them.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
interrelationships.7.3
-
A real-world parallel system will be subject to many additional
design criteria, such as data-structure layout,
memory size, memory-hierarchy latencies, and bandwidth limitations.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... parallelism.7.4
-
This plot shows clock frequencies for newer CPUs theoretically
capable of retiring one or more instructions per clock, and MIPS for
older CPUs requiring multiple clocks to execute even the
simplest instruction.
The reason for taking this approach is that the newer CPUs'
ability to retire multiple instructions per clock is typically
limited by memory-system performance.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... point.7.5
-
The examples in this section are taken from Hart et
al. [HMB06], adapted for clarity
by gathering code related code from multiple files.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... locks.7.6
-
If your program instead has locks in data structures,
or, in the case of Java, uses classes with synchronized
instances, you are instead using ``data locking'', described
in Section
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... loop.7.7
-
Of course, if there are 8 CPUs, each CPU must wait 175 nanoseconds
for each of the other CPUs to do its increment before consuming
an additional 25 nanoseconds doing its own increment.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... empty.7.8
-
Both pool sizes (TARGET_POOL_SIZE and
GLOBAL_POOL_SIZE) are unrealistically small, but this small
size makes it easier to single-step the program in order to get
a feel for its operation.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... results7.9
-
This data was not collected in a statistically meaningful way,
and therefore should be viewed with great skepticism and suspicion.
Good data-collection and -reduction practice is discussed
in Chapter @@@.
That said, repeated runs gave similar results, and these results
match more careful evaluations of similar algorithms.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... mobile.8.1
-
And, in contrast to the 1900s, mobility is the common case.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... quickly).10.1
-
Thanks to James Bottomley for urging me to this
formulation, as opposed to simply saying that
there are no forward-progress guarantees.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... properties.14.1
-
Or, better yet, you can avoid explicit use of memory barriers
entirely.
But that would be the subject of other sections.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... order.14.2
-
Of course, this order might be different from one run
to the next.
On any given run, however, all CPUs and threads must
have a consistent view of the order of critical sections
for a given exclusive lock.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... consistent.14.3
-
A given CPU's series may of course be incomplete,
for example, if a given CPU never loaded or stored
the shared variable, then it can have no opinion about
that variable's value.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... B,14.4
-
For example, by executing the store to A, a
memory barrier, and then the store to B.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... A,14.5
-
For example, by executing the load from B, a
memory barrier, and then the load from A.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... persists.14.6
-
Or, for the more competitively oriented, the first
CPU's store to B ``wins''.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... hardware14.7
-
This is of concern primarily in operating-system kernels.
For more information on hardware operations and memory
ordering, see the files pci.txt, DMA-mapping.txt,
and DMA-API.txt in the Documentation directory in
the Linux source tree [Tor03].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... type.14.8
-
By ``weaker'', we mean "makes fewer ordering guarantees".
A weaker barrier is usually also lower-overhead than is a
stronger barrier.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... caches,14.9
-
But note that in ``superscalar'' systems, the CPU
might well be accessing both halves of its cache at
once, and might in fact be performing multiple concurrent
accesses to each of the halves.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... set'',15.1
-
Due to Josh Triplett.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... frequently.15.2
-
Those of you with strong operating-system backgrounds, please
suspend disbelief.
If you are unable to suspend disbelief, send us a better example.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... operations.17.1
-
This difficulty was pointed out by Michael Factor.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
implications.17.2
-
This difference between mapping and unmapping was noted by
Josh Triplett.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... attributes.17.3
-
Thanks to Mark Moir for pointing me at this spec, and
to Michael Wong for having pointed me at an earlier
revision some time back.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... implementations.17.4
-
Kudos to the TxLinux group, Maged Michael, and Josh Triplett
for coming up with a number of the above alternatives.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... control.B.1
-
There are many other names for similar software constructs, including
``process'', ``task'', ``fiber'', ``event'', and so on.
Similar design principles apply to all of them.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... state,B.2
-
How is that for a circular definition?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... cycles.C.1
-
It is standard practice to use multiple levels of cache,
with a small level-one cache close to the CPU with
single-cycle access time, and a larger level-two cache
with a longer access time, perhaps roughly ten clock cycles.
Higher-performance CPUs often have three or even four levels
of cache.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... states,C.2
-
See Culler et al. [CSG99] pages 670 and 671
for the nine-state and 26-state diagrams for SGI Origin2000
and Sequent (now IBM) NUMA-Q, respectively.
Both diagrams are significantly simpler than real life.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... time.C.3
-
The time required to transfer a cache line from one CPU's cache
to another's is typically a few orders of magnitude more than
that required to execute a simple register-to-register instruction.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... architecture.C.4
-
Readers preferring a detailed look at real hardware
architectures are encouraged to consult CPU vendors'
manuals [SW95,Adv02,Int02b,IBM94,LSH02,SPA94,Int04b,Int04a,Int04c],
Gharachorloo's dissertation [Gha95], or
Peter Sewell's work [Sew].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... see.C.5
-
Any real hardware architect or designer will no doubt be
loudly calling for Ralph on the porcelain intercom,
as they just might be just a bit upset about the prospect of working
out which queue should handle a message involving a cache line
that both CPUs accessed, to say nothing of the many races that
this example poses.
All I can say is ``Give me a better example''.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... up.C.6
-
Of course, the astute reader will have already recognized that
Alpha is nowhere near as mean and nasty as it could be,
the (thankfully) mythical architecture in
Section
being a case in point.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...synchronize_srcu().D.1
-
For example, an SRCU-protected hash table might have a lock
per hash chain, thus allowing at most one block per hash
chain to be waiting for synchronize_srcu().
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... once,D.2
-
Please note that, despite the name, barrier()
has absolutely no effect on the CPU's ability to
reorder execution of both code and of memory accesses.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... CPU.D.3
-
It is important to note that the smp_processor_id() primitive
has long-term meaning only if preemption is disabled.
In absence of preemption disabling, a potential preemption
immediately following execution of this primitive could
cause the subsequent code to execute on some other CPU.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...CONFIG_SMP,D.4
-
For non-CONFIG_SMP, force_quiescent_state is a
simple wrapper around set_need_resched().
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... purchased.F.1
-
Yes, this sudden realization did cause him to walk quite
a bit more carefully.
Why do you ask?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... close.F.2
-
Parallel programming is in some ways more difficult than
sequential programming, for example, parallel validation
is more difficult.
But no longer mind-crushingly difficult.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... setG.1
-
In hardware-cache terminology, the word ``set''
is used in the same way that the word ``bucket''
is used when discussing software caches.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.