Next:
Preface
Up:
2.1 <
Previous:
Legal Statement
Contents
3. Introduction
3.1 Historic Parallel Programming Difficulties
3.2 Parallel Programming Goals
3.2.1 Performance
3.2.2 Productivity
3.2.3 Generality
3.3 Alternatives to Parallel Programming
3.3.1 Multiple Instances of a Sequential Application
3.3.2 Make Use of Existing Parallel Software
3.3.3 Performance Optimization
3.4 What Makes Parallel Programming Hard?
3.4.1 Work Partitioning
3.4.2 Parallel Access Control
3.4.3 Resource Partitioning and Replication
3.4.4 Interacting With Hardware
3.4.5 Composite Capabilities
3.4.6 How Do Languages and Environments Assist With These Tasks?
3.5 Guide to This Book
3.5.1 Quick Quizzes
3.5.2 Sample Source Code
4. Hardware and its Habits
4.1 Overview
4.1.1 Pipelined CPUs
4.1.2 Memory References
4.1.3 Atomic Operations
4.1.4 Memory Barriers
4.1.5 Cache Misses
4.1.6 I/O Operations
4.2 Overheads
4.2.1 Hardware System Architecture
4.2.2 Costs of Operations
4.3 Hardware Free Lunch?
4.3.1 3D Integration
4.3.2 Novel Materials and Processes
4.3.3 Special-Purpose Accelerators
4.3.4 Existing Parallel Software
4.4 Software Design Implications
5. Tools of the Trade
5.1 Scripting Languages
5.2 POSIX Multiprocessing
5.2.1 POSIX Process Creation and Destruction
5.2.2 POSIX Thread Creation and Destruction
5.2.3 POSIX Locking
5.2.4 POSIX Reader-Writer Locking
5.3 Atomic Operations
5.4 Linux-Kernel Equivalents to POSIX Operations
5.5 The Right Tool for the Job: How to Choose?
6. Counting
6.1 Why Isn't Concurrent Counting Trivial?
6.2 Statistical Counters
6.2.1 Design
6.2.2 Array-Based Implementation
6.2.3 Eventually Consistent Implementation
6.2.4 Per-Thread-Variable-Based Implementation
6.2.5 Discussion
6.3 Approximate Limit Counters
6.3.1 Design
6.3.2 Simple Limit Counter Implementation
6.3.3 Simple Limit Counter Discussion
6.3.4 Approximate Limit Counter Implementation
6.3.5 Approximate Limit Counter Discussion
6.4 Exact Limit Counters
6.4.1 Atomic Limit Counter Implementation
6.4.2 Atomic Limit Counter Discussion
6.4.3 Signal-Theft Limit Counter Design
6.4.4 Signal-Theft Limit Counter Implementation
6.4.5 Signal-Theft Limit Counter Discussion
6.5 Applying Specialized Parallel Counters
6.6 Parallel Counting Discussion
7. Partitioning and Synchronization Design
7.1 Partitioning Exercises
7.1.1 Dining Philosophers Problem
7.1.2 Double-Ended Queue
7.1.3 Partitioning Example Discussion
7.2 Design Criteria
7.3 Synchronization Granularity
7.3.1 Sequential Program
7.3.2 Code Locking
7.3.3 Data Locking
7.3.4 Data Ownership
7.3.5 Locking Granularity and Performance
7.4 Parallel Fastpath
7.4.1 Reader/Writer Locking
7.4.2 Hierarchical Locking
7.4.3 Resource Allocator Caches
7.5 Performance Summary
8. Locking
8.1 Staying Alive
8.1.1 Deadlock
8.1.2 Livelock
8.1.3 Unfairness
8.1.4 Inefficiency
8.2 Types of Locks
8.2.1 Exclusive Locks
8.2.2 Reader-Writer Locks
8.2.3 Beyond Reader-Writer Locks
8.3 Lock-Based Existence Guarantees
9. Data Ownership
10. Deferred Processing
10.1 Barriers
10.2 Reference Counting
10.2.1 Implementation of Reference-Counting Categories
10.2.2 Linux Primitives Supporting Reference Counting
10.2.3 Counter Optimizations
10.3 Read-Copy Update (RCU)
10.3.1 RCU Fundamentals
10.3.2 RCU Usage
10.3.3 RCU Linux-Kernel API
10.3.4 ``Toy'' RCU Implementations
10.3.5 RCU Exercises
11. Applying RCU
11.1 RCU and Per-Thread-Variable-Based Statistical Counters
11.1.1 Design
11.1.2 Implementation
11.1.3 Discussion
11.2 RCU and Counters for Removable I/O Devices
12. Validation: Debugging and Analysis
12.1 Tracing
12.2 Assertions
12.3 Static Analysis
12.4 Probability and Heisenbugs
12.5 Profiling
12.6 Differential Profiling
12.7 Performance Estimation
13. Data Structures
13.1 Lists
13.2 Computational Complexity and Performance
13.3 Design Tradeoffs
13.4 Protection
13.5 Bits and Bytes
13.6 Hardware Considerations
14. Advanced Synchronization
14.1 Avoiding Locks
14.2 Memory Barriers
14.2.1 Memory Ordering and Memory Barriers
14.2.2 If B Follows A, and C Follows B, Why Doesn't C Follow A?
14.2.3 Variables Can Have More Than One Value
14.2.4 What Can You Trust?
14.2.5 Review of Locking Implementations
14.2.6 A Few Simple Rules
14.2.7 Abstract Memory Access Model
14.2.8 Device Operations
14.2.9 Guarantees
14.2.10 What Are Memory Barriers?
14.2.11 Locking Constraints
14.2.12 Memory-Barrier Examples
14.2.13 The Effects of the CPU Cache
14.2.14 Where Are Memory Barriers Needed?
14.3 Non-Blocking Synchronization
14.3.1 Simple NBS
14.3.2 Hazard Pointers
14.3.3 Atomic Data Structures
14.3.4 ``Macho'' NBS
15. Ease of Use
15.1 Rusty Scale for API Design
15.2 Shaving the Mandelbrot Set
16. Time Management
17. Conflicting Visions of the Future
17.1 Transactional Memory
17.1.1 I/O Operations
17.1.2 RPC Operations
17.1.3 Memory-Mapping Operations
17.1.4 Multithreaded Transactions
17.1.5 Extra-Transactional Accesses
17.1.6 Time Delays
17.1.7 Locking
17.1.8 Reader-Writer Locking
17.1.9 Persistence
17.1.10 Dynamic Linking and Loading
17.1.11 Debugging
17.1.12 The exec() System Call
17.1.13 RCU
17.1.14 Discussion
17.2 Shared-Memory Parallel Functional Programming
17.3 Process-Based Parallel Functional Programming
A. Important Questions
A.1 What Does ``After'' Mean?
B. Synchronization Primitives
B.1 Organization and Initialization
B.1.1 smp_init():
B.2 Thread Creation, Destruction, and Control
B.2.1 create_thread()
B.2.2 smp_thread_id()
B.2.3 for_each_thread()
B.2.4 for_each_running_thread()
B.2.5 wait_thread()
B.2.6 wait_all_threads()
B.2.7 Example Usage
B.3 Locking
B.3.1 spin_lock_init()
B.3.2 spin_lock()
B.3.3 spin_trylock()
B.3.4 spin_unlock()
B.3.5 Example Usage
B.4 Per-Thread Variables
B.4.1 DEFINE_PER_THREAD()
B.4.2 DECLARE_PER_THREAD()
B.4.3 per_thread()
B.4.4 __get_thread_var()
B.4.5 init_per_thread()
B.4.6 Usage Example
B.5 Performance
C. Why Memory Barriers?
C.1 Cache Structure
C.2 Cache-Coherence Protocols
C.2.1 MESI States
C.2.2 MESI Protocol Messages
C.2.3 MESI State Diagram
C.2.4 MESI Protocol Example
C.3 Stores Result in Unnecessary Stalls
C.3.1 Store Buffers
C.3.2 Store Forwarding
C.3.3 Store Buffers and Memory Barriers
C.4 Store Sequences Result in Unnecessary Stalls
C.4.1 Invalidate Queues
C.4.2 Invalidate Queues and Invalidate Acknowledge
C.4.3 Invalidate Queues and Memory Barriers
C.5 Read and Write Memory Barriers
C.6 Example Memory-Barrier Sequences
C.6.1 Ordering-Hostile Architecture
C.6.2 Example 1
C.6.3 Example 2
C.6.4 Example 3
C.7 Memory-Barrier Instructions For Specific CPUs
C.7.1 Alpha
C.7.2 AMD64
C.7.3 ARMv7-A/R
C.7.4 IA64
C.7.5 PA-RISC
C.7.6 POWER / Power PC
C.7.7 SPARC RMO, PSO, and TSO
C.7.8 x86
C.7.9 zSeries
C.8 Are Memory Barriers Forever?
C.9 Advice to Hardware Designers
D. Read-Copy Update Implementations
D.1 Sleepable RCU Implementation
D.1.1 SRCU Implementation Strategy
D.1.2 SRCU API and Usage
D.1.3 Implementation
D.1.4 SRCU Summary
D.2 Hierarchical RCU Overview
D.2.1 Review of RCU Fundamentals
D.2.2 Brief Overview of Classic RCU Implementation
D.2.3 RCU Desiderata
D.2.4 Towards a More Scalable RCU Implementation
D.2.5 Towards a Greener RCU Implementation
D.2.6 State Machine
D.2.7 Use Cases
D.2.8 Testing
D.2.9 Conclusion
D.3 Hierarchical RCU Code Walkthrough
D.3.1 Data Structures and Kernel Parameters
D.3.2 External Interfaces
D.3.3 Initialization
D.3.4 CPU Hotplug
D.3.5 Miscellaneous Functions
D.3.6 Grace-Period-Detection Functions
D.3.7 Dyntick-Idle Functions
D.3.8 Forcing Quiescent States
D.3.9 CPU-Stall Detection
D.3.10 Possible Flaws and Changes
D.4 Preemptible RCU
D.4.1 Conceptual RCU
D.4.2 Overview of Preemptible RCU Algorithm
D.4.3 Validation of Preemptible RCU
E. Formal Verification
E.1 What are Promela and Spin?
E.2 Promela Example: Non-Atomic Increment
E.3 Promela Example: Atomic Increment
E.3.1 Combinatorial Explosion
E.4 How to Use Promela
E.4.1 Promela Peculiarities
E.4.2 Promela Coding Tricks
E.5 Promela Example: Locking
E.6 Promela Example: QRCU
E.6.1 Running the QRCU Example
E.6.2 How Many Readers and Updaters Are Really Needed?
E.6.3 Alternative Approach: Proof of Correctness
E.6.4 Alternative Approach: More Capable Tools
E.6.5 Alternative Approach: Divide and Conquer
E.7 Promela Parable: dynticks and Preemptible RCU
E.7.1 Introduction to Preemptible RCU and dynticks
E.7.2 Validating Preemptible RCU and dynticks
E.7.3 Lessons (Re)Learned
E.8 Simplicity Avoids Formal Verification
E.8.1 State Variables for Simplified Dynticks Interface
E.8.2 Entering and Leaving Dynticks-Idle Mode
E.8.3 NMIs From Dynticks-Idle Mode
E.8.4 Interrupts From Dynticks-Idle Mode
E.8.5 Checking For Dynticks Quiescent States
E.8.6 Discussion
E.9 Summary
F. Answers to Quick Quizzes
F.1 Chapter
F.2 Chapter
F.3 Chapter
F.4 Chapter
F.5 Chapter
F.6 Chapter
F.7 Chapter
F.8 Chapter
F.9 Chapter
F.10 Chapter
F.11 Chapter
F.12 Chapter
F.13 Chapter
F.14 Chapter
F.15 Chapter
F.16 Chapter
G. Glossary
Bibliography
H. Credits
H.1 Authors
H.2 Reviewers
H.3 Machine Owners
H.4 Original Publications
H.5 Figure Credits
H.6 Other Support
Paul E. McKenney 2011-12-16