The POSIX API provides a reader-writer lock, which is represented by a pthread_rwlock_t. As with pthread_mutex_t, pthread_rwlock_t may be statically initialized via PTHREAD_RWLOCK_INITIALIZER or dynamically initialized via the pthread_rwlock_init() primitive. The pthread_rwlock_rdlock() primitive read-acquires the specified pthread_rwlock_t, the pthread_rwlock_wrlock() primitive write-acquires it, and the pthread_rwlock_unlock() primitive releases it. Only a single thread may write-hold a given pthread_rwlock_t at any given time, but multiple threads may read-hold a given pthread_rwlock_t, at least while there is no thread currently write-holding it.
As you might expect, reader-writer locks are designed for read-mostly situations. In these situations, a reader-writer lock can provide greater scalability than can an exclusive lock because the exclusive lock is by definition limited to a single thread holding the lock at any given time, while the reader-writer lock permits an arbitrarily large number of readers to concurrently hold the lock. However, in practice, we need to know how much additional scalability is provided by reader-writer locks.
Figure
(rwlockscale.c)
shows one way of measuring reader-writer lock scalability.
Line 1 shows the definition and initialization of the reader-writer
lock, line 2 shows the holdtime argument controlling the
time each thread holds the reader-writer lock,
line 3 shows the thinktime argument controlling the time between
the release of the reader-writer lock and the next acquisition,
line 4 defines the readcounts array into which each reader thread
places the number of times it acquired the lock, and
line 5 defines the nreadersrunning variable, which
determines when all reader threads have started running.
Lines 7-10 define goflag, which synchronizes the start and the end of the test. This variable is initially set to GOFLAG_INIT, then set to GOFLAG_RUN after all the reader threads have started, and finally set to GOFLAG_STOP to terminate the test run.
Lines 12-41 define reader(), which is the reader thread. Line 18 atomically increments the nreadersrunning variable to indicate that this thread is now running, and lines 19-21 wait for the test to start. The ACCESS_ONCE() primitive forces the compiler to fetch goflag on each pass through the loop--the compiler would otherwise be within its rights to assume that the value of goflag would never change.
The loop spanning lines 22-38 carries out the performance test. Lines 23-26 acquire the lock, lines 27-29 hold the lock for the specified duration (and the barrier() directive prevents the compiler from optimizing the loop out of existence), lines 30-33 release the lock, and lines 34-36 wait for the specified duration before re-acquiring the lock. Line 37 counts this lock acquisition.
Line 38 moves the lock-acquisition count to this thread's element of the readcounts[] array, and line 40 returns, terminating this thread.
Figure
shows the results of running this test on a 64-core Power-5 system
with two hardware threads per core for a total of 128 software-visible
CPUs.
The thinktime parameter was zero for all these tests, and the
holdtime parameter set to values ranging from one thousand (``1K''
on the graph) to 100 million (``100M'' on the graph).
The actual value plotted is:
![]() |
(5.1) |
As can be seen in the figure, reader-writer locking scalability is decidedly non-ideal, especially for smaller sizes of critical sections. To see why read-acquisition can be so slow, consider that all the acquiring threads must update the pthread_rwlock_t data structure. Therefore, if all 128 executing threads attempt to read-acquire the reader-writer lock concurrently, they must update this underlying pthread_rwlock_t one at a time. One lucky thread might do so almost immediately, but the least-lucky thread must wait for all the other 127 threads to do their updates. This situation will only get worse as you add CPUs.
Quick Quiz 5.15: Isn't comparing against single-CPU throughput a bit harsh? End Quick Quiz
Quick Quiz 5.16: But 1,000 instructions is not a particularly small size for a critical section. What do I do if I need a much smaller critical section, for example, one containing only a few tens of instructions? End Quick Quiz
Quick Quiz 5.17:
In
Figure ,
all of the traces other than the 100M trace deviate gently
from the ideal line.
In contrast, the 100M trace breaks sharply from the ideal
line at 64 CPUs.
In addition, the spacing between the 100M trace and the 10M
trace is much smaller than that between the 10M trace and the
1M trace.
Why does the 100M trace behave so much differently than the
other traces?
End Quick Quiz
Quick Quiz 5.18: Power 5 is several years old, and new hardware should be faster. So why should anyone worry about reader-writer locks being slow? End Quick Quiz
Despite these limitations, reader-writer locking is quite useful in many
cases, for example when the readers must do high-latency file or network I/O.
There are alternatives, some of which will be presented in
Chapters and
.
Paul E. McKenney 2011-12-16