C.7 Memory-Barrier Instructions For Specific CPUs

Each CPU has its own peculiar memory-barrier instructions, which can make portability a challenge, as indicated by Table [*]. In fact, many software environments, including pthreads and Java, simply prohibit direct use of memory barriers, restricting the programmer to mutual-exclusion primitives that incorporate them to the extent that they are required. In the table, the first four columns indicate whether a given CPU allows the four possible combinations of loads and stores to be reordered. The next two columns indicate whether a given CPU allows loads and stores to be reordered with atomic instructions.

The seventh column, data-dependent reads reordered, requires some explanation, which is undertaken in the following section covering Alpha CPUs. The short version is that Alpha requires memory barriers for readers as well as updaters of linked data structures. Yes, this does mean that Alpha can in effect fetch the data pointed to before it fetches the pointer itself, strange but true. Please see: http://www.openvms.compaq.com/wizard/wiz_2637.htmlif you think that I am just making this up. The benefit of this extremely weak memory model is that Alpha can use simpler cache hardware, which in turn permitted higher clock frequency in Alpha's heyday.

The last column indicates whether a given CPU has a incoherent instruction cache and pipeline. Such CPUs require special instructions be executed for self-modifying code.

Parenthesized CPU names indicate modes that are architecturally allowed, but rarely used in practice.


Table: Summary of Memory Ordering
         
\begin{picture}(6,185)(0,0)
\rotatebox{90}{Loads Reordered After Loads?}
\end{picture}

\begin{picture}(6,185)(0,0)
\rotatebox{90}{Loads Reordered After Stores?}
\end{picture}

\begin{picture}(6,185)(0,0)
\rotatebox{90}{Stores Reordered After Stores?}
\end{picture}

\begin{picture}(6,185)(0,0)
\rotatebox{90}{Stores Reordered After Loads?}
\end{picture}

\begin{picture}(6,185)(0,0)
\rotatebox{90}{Atomic Instructions Reordered With Loads?}
\end{picture}

\begin{picture}(6,185)(0,0)
\rotatebox{90}{Atomic Instructions Reordered With Stores?}
\end{picture}

\begin{picture}(6,185)(0,0)
\rotatebox{90}{Dependent Loads Reordered?}
\end{picture}

\begin{picture}(6,185)(0,0)
\rotatebox{90}{Incoherent Instruction Cache/Pipeline?}
\end{picture}
Alpha Y Y Y Y Y Y Y Y
AMD64       Y        
ARMv7-A/R Y Y Y Y Y Y   Y
IA64 Y Y Y Y Y Y   Y
(PA-RISC) Y Y Y Y        
PA-RISC CPUs                
POWERTM Y Y Y Y Y Y   Y
(SPARC RMO) Y Y Y Y Y Y   Y
(SPARC PSO)     Y Y   Y   Y
SPARC TSO       Y       Y
x86       Y       Y
(x86 OOStore) Y Y Y Y       Y
zSeries\textregistered       Y       Y


The common "just say no" approach to memory barriers can be eminently reasonable where it applies, but there are environments, such as the Linux kernel, where direct use of memory barriers is required. Therefore, Linux provides a carefully chosen least-common-denominator set of memory-barrier primitives, which are as follows:

The smp_mb(), smp_rmb(), and smp_wmb() primitives also force the compiler to eschew any optimizations that would have the effect of reordering memory optimizations across the barriers. The smp_read_barrier_depends() primitive has a similar effect, but only on Alpha CPUs. See Section [*] for more information on use of these primitives.

These primitives generate code only in SMP kernels, however, each also has a UP version (mb(), rmb(), wmb(), and read_barrier_depends(), respectively) that generate a memory barrier even in UP kernels. The smp_ versions should be used in most cases. However, these latter primitives are useful when writing drivers, because MMIO accesses must remain ordered even in UP kernels. In absence of memory-barrier instructions, both CPUs and compilers would happily rearrange these accesses, which at best would make the device act strangely, and could crash your kernel or, in some cases, even damage your hardware.

So most kernel programmers need not worry about the memory-barrier peculiarities of each and every CPU, as long as they stick to these interfaces. If you are working deep in a given CPU's architecture-specific code, of course, all bets are off.

Furthermore, all of Linux's locking primitives (spinlocks, reader-writer locks, semaphores, RCU, ...) include any needed barrier primitives. So if you are working with code that uses these primitives, you don't even need to worry about Linux's memory-ordering primitives.

That said, deep knowledge of each CPU's memory-consistency model can be very helpful when debugging, to say nothing of when writing architecture-specific code or synchronization primitives.

Besides, they say that a little knowledge is a very dangerous thing. Just imagine the damage you could do with a lot of knowledge! For those who wish to understand more about individual CPUs' memory consistency models, the next sections describes those of the most popular and prominent CPUs. Although nothing can replace actually reading a given CPU's documentation, these sections give a good overview.



Subsections
Paul E. McKenney 2011-12-16