Next:
Legal Statement
Contents
Is Parallel Programming Hard, And, If So, What Can You Do About It?
Edited by:
Paul E. McKenney
Legal Statement
Contents
Preface
3
. Introduction
3
.
1
Historic Parallel Programming Difficulties
3
.
2
Parallel Programming Goals
3
.
2
.
1
Performance
3
.
2
.
2
Productivity
3
.
2
.
3
Generality
3
.
3
Alternatives to Parallel Programming
3
.
3
.
1
Multiple Instances of a Sequential Application
3
.
3
.
2
Make Use of Existing Parallel Software
3
.
3
.
3
Performance Optimization
3
.
4
What Makes Parallel Programming Hard?
3
.
4
.
1
Work Partitioning
3
.
4
.
2
Parallel Access Control
3
.
4
.
3
Resource Partitioning and Replication
3
.
4
.
4
Interacting With Hardware
3
.
4
.
5
Composite Capabilities
3
.
4
.
6
How Do Languages and Environments Assist With These Tasks?
3
.
5
Guide to This Book
3
.
5
.
1
Quick Quizzes
3
.
5
.
2
Sample Source Code
4
. Hardware and its Habits
4
.
1
Overview
4
.
1
.
1
Pipelined CPUs
4
.
1
.
2
Memory References
4
.
1
.
3
Atomic Operations
4
.
1
.
4
Memory Barriers
4
.
1
.
5
Cache Misses
4
.
1
.
6
I/O Operations
4
.
2
Overheads
4
.
2
.
1
Hardware System Architecture
4
.
2
.
2
Costs of Operations
4
.
3
Hardware Free Lunch?
4
.
3
.
1
3D Integration
4
.
3
.
2
Novel Materials and Processes
4
.
3
.
3
Special-Purpose Accelerators
4
.
3
.
4
Existing Parallel Software
4
.
4
Software Design Implications
5
. Tools of the Trade
5
.
1
Scripting Languages
5
.
2
POSIX Multiprocessing
5
.
2
.
1
POSIX Process Creation and Destruction
5
.
2
.
2
POSIX Thread Creation and Destruction
5
.
2
.
3
POSIX Locking
5
.
2
.
4
POSIX Reader-Writer Locking
5
.
3
Atomic Operations
5
.
4
Linux-Kernel Equivalents to POSIX Operations
5
.
5
The Right Tool for the Job: How to Choose?
6
. Counting
6
.
1
Why Isn't Concurrent Counting Trivial?
6
.
2
Statistical Counters
6
.
2
.
1
Design
6
.
2
.
2
Array-Based Implementation
6
.
2
.
3
Eventually Consistent Implementation
6
.
2
.
4
Per-Thread-Variable-Based Implementation
6
.
2
.
5
Discussion
6
.
3
Approximate Limit Counters
6
.
3
.
1
Design
6
.
3
.
2
Simple Limit Counter Implementation
6
.
3
.
3
Simple Limit Counter Discussion
6
.
3
.
4
Approximate Limit Counter Implementation
6
.
3
.
5
Approximate Limit Counter Discussion
6
.
4
Exact Limit Counters
6
.
4
.
1
Atomic Limit Counter Implementation
6
.
4
.
2
Atomic Limit Counter Discussion
6
.
4
.
3
Signal-Theft Limit Counter Design
6
.
4
.
4
Signal-Theft Limit Counter Implementation
6
.
4
.
5
Signal-Theft Limit Counter Discussion
6
.
5
Applying Specialized Parallel Counters
6
.
6
Parallel Counting Discussion
7
. Partitioning and Synchronization Design
7
.
1
Partitioning Exercises
7
.
1
.
1
Dining Philosophers Problem
7
.
1
.
2
Double-Ended Queue
7
.
1
.
2
.
1
Right- and Left-Hand Locks
7
.
1
.
2
.
2
Compound Double-Ended Queue
7
.
1
.
2
.
3
Hashed Double-Ended Queue
7
.
1
.
2
.
4
Compound Double-Ended Queue Revisited
7
.
1
.
2
.
5
Double-Ended Queue Discussion
7
.
1
.
3
Partitioning Example Discussion
7
.
2
Design Criteria
7
.
3
Synchronization Granularity
7
.
3
.
1
Sequential Program
7
.
3
.
2
Code Locking
7
.
3
.
3
Data Locking
7
.
3
.
4
Data Ownership
7
.
3
.
5
Locking Granularity and Performance
7
.
4
Parallel Fastpath
7
.
4
.
1
Reader/Writer Locking
7
.
4
.
2
Hierarchical Locking
7
.
4
.
3
Resource Allocator Caches
7
.
4
.
3
.
1
Parallel Resource Allocation Problem
7
.
4
.
3
.
2
Parallel Fastpath for Resource Allocation
7
.
4
.
3
.
3
Data Structures
7
.
4
.
3
.
4
Allocation Function
7
.
4
.
3
.
5
Free Function
7
.
4
.
3
.
6
Performance
7
.
4
.
3
.
7
Real-World Design
7
.
5
Performance Summary
8
. Locking
8
.
1
Staying Alive
8
.
1
.
1
Deadlock
8
.
1
.
2
Livelock
8
.
1
.
3
Unfairness
8
.
1
.
4
Inefficiency
8
.
2
Types of Locks
8
.
2
.
1
Exclusive Locks
8
.
2
.
2
Reader-Writer Locks
8
.
2
.
3
Beyond Reader-Writer Locks
8
.
3
Lock-Based Existence Guarantees
9
. Data Ownership
10
. Deferred Processing
10
.
1
Barriers
10
.
2
Reference Counting
10
.
2
.
1
Implementation of Reference-Counting Categories
10
.
2
.
1
.
1
Simple Counting
10
.
2
.
1
.
2
Atomic Counting
10
.
2
.
1
.
3
Atomic Counting With Release Memory Barrier
10
.
2
.
1
.
4
Atomic Counting With Check and Release Memory Barrier
10
.
2
.
2
Linux Primitives Supporting Reference Counting
10
.
2
.
3
Counter Optimizations
10
.
3
Read-Copy Update (RCU)
10
.
3
.
1
RCU Fundamentals
10
.
3
.
1
.
1
Publish-Subscribe Mechanism
10
.
3
.
1
.
2
Wait For Pre-Existing RCU Readers to Complete
10
.
3
.
1
.
3
Maintain Multiple Versions of Recently Updated Objects
10
.
3
.
1
.
4
Summary of RCU Fundamentals
10
.
3
.
2
RCU Usage
10
.
3
.
2
.
1
RCU is a Reader-Writer Lock Replacement
10
.
3
.
2
.
2
RCU is a Restricted Reference-Counting Mechanism
10
.
3
.
2
.
3
RCU is a Bulk Reference-Counting Mechanism
10
.
3
.
2
.
4
RCU is a Poor Man's Garbage Collector
10
.
3
.
2
.
5
RCU is a Way of Providing Existence Guarantees
10
.
3
.
2
.
6
RCU is a Way of Providing Type-Safe Memory
10
.
3
.
2
.
7
RCU is a Way of Waiting for Things to Finish
10
.
3
.
2
.
8
RCU Usage Summary
10
.
3
.
3
RCU Linux-Kernel API
10
.
3
.
3
.
1
RCU has a Family of Wait-to-Finish APIs
10
.
3
.
3
.
2
RCU has Publish-Subscribe and Version-Maintenance APIs
10
.
3
.
3
.
3
Where Can RCU's APIs Be Used?
10
.
3
.
3
.
4
So, What
is
RCU Really?
10
.
3
.
4
``Toy'' RCU Implementations
10
.
3
.
4
.
1
Lock-Based RCU
10
.
3
.
4
.
2
Per-Thread Lock-Based RCU
10
.
3
.
4
.
3
Simple Counter-Based RCU
10
.
3
.
4
.
4
Starvation-Free Counter-Based RCU
10
.
3
.
4
.
5
Scalable Counter-Based RCU
10
.
3
.
4
.
6
Scalable Counter-Based RCU With Shared Grace Periods
10
.
3
.
4
.
7
RCU Based on Free-Running Counter
10
.
3
.
4
.
8
Nestable RCU Based on Free-Running Counter
10
.
3
.
4
.
9
RCU Based on Quiescent States
10
.
3
.
4
.
10
Summary of Toy RCU Implementations
10
.
3
.
5
RCU Exercises
11
. Applying RCU
11
.
1
RCU and Per-Thread-Variable-Based Statistical Counters
11
.
1
.
1
Design
11
.
1
.
2
Implementation
11
.
1
.
3
Discussion
11
.
2
RCU and Counters for Removable I/O Devices
12
. Validation: Debugging and Analysis
12
.
1
Tracing
12
.
2
Assertions
12
.
3
Static Analysis
12
.
4
Probability and Heisenbugs
12
.
5
Profiling
12
.
6
Differential Profiling
12
.
7
Performance Estimation
13
. Data Structures
13
.
1
Lists
13
.
2
Computational Complexity and Performance
13
.
3
Design Tradeoffs
13
.
4
Protection
13
.
5
Bits and Bytes
13
.
6
Hardware Considerations
14
. Advanced Synchronization
14
.
1
Avoiding Locks
14
.
2
Memory Barriers
14
.
2
.
1
Memory Ordering and Memory Barriers
14
.
2
.
2
If B Follows A, and C Follows B, Why Doesn't C Follow A?
14
.
2
.
3
Variables Can Have More Than One Value
14
.
2
.
4
What Can You Trust?
14
.
2
.
4
.
1
Self-References Are Ordered
14
.
2
.
4
.
2
Single-Variable Memory Consistency
14
.
2
.
4
.
3
Pair-Wise Memory Barriers
14
.
2
.
4
.
4
Pair-Wise Memory Barriers: Portable Combinations
14
.
2
.
4
.
5
Pair-Wise Memory Barriers: Semi-Portable Combinations
14
.
2
.
4
.
6
Pair-Wise Memory Barriers: Non-Portable Combinations
14
.
2
.
4
.
7
Semantics Sufficient to Implement Locking
14
.
2
.
5
Review of Locking Implementations
14
.
2
.
6
A Few Simple Rules
14
.
2
.
7
Abstract Memory Access Model
14
.
2
.
8
Device Operations
14
.
2
.
9
Guarantees
14
.
2
.
10
What Are Memory Barriers?
14
.
2
.
10
.
1
Explicit Memory Barriers
14
.
2
.
10
.
2
Implicit Memory Barriers
14
.
2
.
10
.
3
What May Not Be Assumed About Memory Barriers?
14
.
2
.
10
.
4
Data Dependency Barriers
14
.
2
.
10
.
5
Control Dependencies
14
.
2
.
10
.
6
SMP Barrier Pairing
14
.
2
.
10
.
7
Examples of Memory Barrier Pairings
14
.
2
.
10
.
8
Read Memory Barriers vs. Load Speculation
14
.
2
.
11
Locking Constraints
14
.
2
.
12
Memory-Barrier Examples
14
.
2
.
12
.
1
Locking Examples
14
.
2
.
13
The Effects of the CPU Cache
14
.
2
.
13
.
1
Cache Coherency
14
.
2
.
14
Where Are Memory Barriers Needed?
14
.
3
Non-Blocking Synchronization
14
.
3
.
1
Simple NBS
14
.
3
.
2
Hazard Pointers
14
.
3
.
3
Atomic Data Structures
14
.
3
.
4
``Macho'' NBS
15
. Ease of Use
15
.
1
Rusty Scale for API Design
15
.
2
Shaving the Mandelbrot Set
16
. Time Management
17
. Conflicting Visions of the Future
17
.
1
Transactional Memory
17
.
1
.
1
I/O Operations
17
.
1
.
2
RPC Operations
17
.
1
.
3
Memory-Mapping Operations
17
.
1
.
4
Multithreaded Transactions
17
.
1
.
5
Extra-Transactional Accesses
17
.
1
.
6
Time Delays
17
.
1
.
7
Locking
17
.
1
.
8
Reader-Writer Locking
17
.
1
.
9
Persistence
17
.
1
.
10
Dynamic Linking and Loading
17
.
1
.
11
Debugging
17
.
1
.
12
The exec() System Call
17
.
1
.
13
RCU
17
.
1
.
14
Discussion
17
.
2
Shared-Memory Parallel Functional Programming
17
.
3
Process-Based Parallel Functional Programming
A. Important Questions
A.
1
What Does ``After'' Mean?
B. Synchronization Primitives
B.
1
Organization and Initialization
B.
1
.
1
smp_init():
B.
2
Thread Creation, Destruction, and Control
B.
2
.
1
create_thread()
B.
2
.
2
smp_thread_id()
B.
2
.
3
for_each_thread()
B.
2
.
4
for_each_running_thread()
B.
2
.
5
wait_thread()
B.
2
.
6
wait_all_threads()
B.
2
.
7
Example Usage
B.
3
Locking
B.
3
.
1
spin_lock_init()
B.
3
.
2
spin_lock()
B.
3
.
3
spin_trylock()
B.
3
.
4
spin_unlock()
B.
3
.
5
Example Usage
B.
4
Per-Thread Variables
B.
4
.
1
DEFINE_PER_THREAD()
B.
4
.
2
DECLARE_PER_THREAD()
B.
4
.
3
per_thread()
B.
4
.
4
__get_thread_var()
B.
4
.
5
init_per_thread()
B.
4
.
6
Usage Example
B.
5
Performance
C. Why Memory Barriers?
C.
1
Cache Structure
C.
2
Cache-Coherence Protocols
C.
2
.
1
MESI States
C.
2
.
2
MESI Protocol Messages
C.
2
.
3
MESI State Diagram
C.
2
.
4
MESI Protocol Example
C.
3
Stores Result in Unnecessary Stalls
C.
3
.
1
Store Buffers
C.
3
.
2
Store Forwarding
C.
3
.
3
Store Buffers and Memory Barriers
C.
4
Store Sequences Result in Unnecessary Stalls
C.
4
.
1
Invalidate Queues
C.
4
.
2
Invalidate Queues and Invalidate Acknowledge
C.
4
.
3
Invalidate Queues and Memory Barriers
C.
5
Read and Write Memory Barriers
C.
6
Example Memory-Barrier Sequences
C.
6
.
1
Ordering-Hostile Architecture
C.
6
.
2
Example 1
C.
6
.
3
Example 2
C.
6
.
4
Example 3
C.
7
Memory-Barrier Instructions For Specific CPUs
C.
7
.
1
Alpha
C.
7
.
2
AMD64
C.
7
.
3
ARMv7-A/R
C.
7
.
4
IA64
C.
7
.
5
PA-RISC
C.
7
.
6
POWER / Power PC
C.
7
.
7
SPARC RMO, PSO, and TSO
C.
7
.
8
x86
C.
7
.
9
zSeries
C.
8
Are Memory Barriers Forever?
C.
9
Advice to Hardware Designers
D. Read-Copy Update Implementations
D.
1
Sleepable RCU Implementation
D.
1
.
1
SRCU Implementation Strategy
D.
1
.
1
.
1
Abolish Asynchronous Grace-Period APIs
D.
1
.
1
.
2
Isolate Grace-Period Detection
D.
1
.
2
SRCU API and Usage
D.
1
.
2
.
1
Initialization and Cleanup
D.
1
.
2
.
2
Read-Side Primitives
D.
1
.
2
.
3
Update-Side Primitives
D.
1
.
2
.
4
Cleaning Up Safely
D.
1
.
3
Implementation
D.
1
.
3
.
1
Data Structures
D.
1
.
3
.
2
Initialization Implementation
D.
1
.
3
.
3
Read-Side Implementation
D.
1
.
3
.
4
Update-Side Implementation
D.
1
.
4
SRCU Summary
D.
2
Hierarchical RCU Overview
D.
2
.
1
Review of RCU Fundamentals
D.
2
.
2
Brief Overview of Classic RCU Implementation
D.
2
.
3
RCU Desiderata
D.
2
.
4
Towards a More Scalable RCU Implementation
D.
2
.
5
Towards a Greener RCU Implementation
D.
2
.
6
State Machine
D.
2
.
7
Use Cases
D.
2
.
7
.
1
Start a New Grace Period
D.
2
.
7
.
2
Pass Through a Quiescent State
D.
2
.
7
.
3
Announce a Quiescent State to RCU
D.
2
.
7
.
4
Enter and Leave Dynticks Idle Mode
D.
2
.
7
.
5
Interrupt from Dynticks Idle Mode
D.
2
.
7
.
6
NMI from Dynticks Idle Mode
D.
2
.
7
.
7
Note That a CPU is in Dynticks Idle Mode
D.
2
.
7
.
8
Offline a CPU
D.
2
.
7
.
9
Online a CPU
D.
2
.
7
.
10
Detect a Too-Long Grace Period
D.
2
.
8
Testing
D.
2
.
9
Conclusion
D.
3
Hierarchical RCU Code Walkthrough
D.
3
.
1
Data Structures and Kernel Parameters
D.
3
.
1
.
1
Tracking Dyntick State
D.
3
.
1
.
2
Nodes in the Hierarchy
D.
3
.
1
.
3
Per-CPU Data
D.
3
.
1
.
4
RCU Global State
D.
3
.
1
.
5
Kernel Parameters
D.
3
.
2
External Interfaces
D.
3
.
2
.
1
Read-Side Critical Sections
D.
3
.
2
.
2
call_rcu()
D.
3
.
2
.
3
rcu_check_callbacks()
D.
3
.
2
.
4
rcu_process_callbacks()
D.
3
.
2
.
5
rcu_needs_cpu()
and
rcu_cpu_notify()
D.
3
.
3
Initialization
D.
3
.
3
.
1
rcu_init_levelspread()
D.
3
.
3
.
2
rcu_init_one()
D.
3
.
3
.
3
__rcu_init()
D.
3
.
4
CPU Hotplug
D.
3
.
4
.
1
rcu_init_percpu_data()
D.
3
.
4
.
2
rcu_online_cpu()
D.
3
.
4
.
3
rcu_offline_cpu()
D.
3
.
5
Miscellaneous Functions
D.
3
.
6
Grace-Period-Detection Functions
D.
3
.
6
.
1
Noting New Grace Periods
D.
3
.
6
.
2
Noting End of Old Grace Periods
D.
3
.
6
.
3
Starting a Grace Period
D.
3
.
6
.
4
Reporting Quiescent States
D.
3
.
7
Dyntick-Idle Functions
D.
3
.
7
.
1
Entering and Exiting Dyntick-Idle Mode
D.
3
.
7
.
2
NMIs from Dyntick-Idle Mode
D.
3
.
7
.
3
Interrupts from Dyntick-Idle Mode
D.
3
.
7
.
4
Checking for Dyntick-Idle Mode
D.
3
.
8
Forcing Quiescent States
D.
3
.
8
.
1
Recording and Recalling Dynticks-Idle Grace Period
D.
3
.
8
.
2
Handling Offline and Holdout CPUs
D.
3
.
8
.
3
Scanning for Holdout CPUs
D.
3
.
8
.
4
Code for
force_quiescent_state()
D.
3
.
9
CPU-Stall Detection
D.
3
.
10
Possible Flaws and Changes
D.
4
Preemptible RCU
D.
4
.
1
Conceptual RCU
D.
4
.
2
Overview of Preemptible RCU Algorithm
D.
4
.
2
.
1
General Approach
D.
4
.
2
.
2
Data Structures
D.
4
.
2
.
3
Grace-Period State Machine
D.
4
.
2
.
4
Read-Side Primitives
D.
4
.
3
Validation of Preemptible RCU
D.
4
.
3
.
1
Testing
D.
4
.
3
.
2
Conceptual Validation
D.
4
.
3
.
3
Formal Validation
E. Formal Verification
E.
1
What are Promela and Spin?
E.
2
Promela Example: Non-Atomic Increment
E.
3
Promela Example: Atomic Increment
E.
3
.
1
Combinatorial Explosion
E.
4
How to Use Promela
E.
4
.
1
Promela Peculiarities
E.
4
.
2
Promela Coding Tricks
E.
5
Promela Example: Locking
E.
6
Promela Example: QRCU
E.
6
.
1
Running the QRCU Example
E.
6
.
2
How Many Readers and Updaters Are Really Needed?
E.
6
.
3
Alternative Approach: Proof of Correctness
E.
6
.
4
Alternative Approach: More Capable Tools
E.
6
.
5
Alternative Approach: Divide and Conquer
E.
7
Promela Parable: dynticks and Preemptible RCU
E.
7
.
1
Introduction to Preemptible RCU and dynticks
E.
7
.
1
.
1
Task Interface
E.
7
.
1
.
2
Interrupt Interface
E.
7
.
1
.
3
Grace-Period Interface
E.
7
.
2
Validating Preemptible RCU and dynticks
E.
7
.
2
.
1
Basic Model
E.
7
.
2
.
2
Validating Safety
E.
7
.
2
.
3
Validating Liveness
E.
7
.
2
.
4
Interrupts
E.
7
.
2
.
5
Validating Interrupt Handlers
E.
7
.
2
.
6
Validating Nested Interrupt Handlers
E.
7
.
2
.
7
Validating NMI Handlers
E.
7
.
3
Lessons (Re)Learned
E.
8
Simplicity Avoids Formal Verification
E.
8
.
1
State Variables for Simplified Dynticks Interface
E.
8
.
2
Entering and Leaving Dynticks-Idle Mode
E.
8
.
3
NMIs From Dynticks-Idle Mode
E.
8
.
4
Interrupts From Dynticks-Idle Mode
E.
8
.
5
Checking For Dynticks Quiescent States
E.
8
.
6
Discussion
E.
9
Summary
F. Answers to Quick Quizzes
F.
1
Chapter
F.
2
Chapter
F.
3
Chapter
F.
4
Chapter
F.
5
Chapter
F.
6
Chapter
F.
7
Chapter
F.
8
Chapter
F.
9
Chapter
F.
10
Chapter
F.
11
Chapter
F.
12
Chapter
F.
13
Chapter
F.
14
Chapter
F.
15
Chapter
F.
16
Chapter
G. Glossary
Bibliography
H. Credits
H.
1
Authors
H.
2
Reviewers
H.
3
Machine Owners
H.
4
Original Publications
H.
5
Figure Credits
H.
6
Other Support
About this document ...
Paul E. McKenney 2011-12-16