In the , I
explained how to modify SWP code to make use of compiler intrinsics.
Using intrinsics hides the underlying detail needed to use the load and
store exclusive instructions and the use of memory barriers. In this article I look at implementing atomic memory accesses in assembler.
In order to describe memory barriers, what they are and how they should be used, I need to describe two types of memory model, strongly ordered and weakly ordered. The strongly-ordered
model is very natural for programmers. In this model, the order that a
program writes data to memory is the order in which the data is observed
being written into memory, that is, other programs sharing the data
will "see" the same ordering regardless of the CPU that they are
executing on.
For example, if a CPU writes a new X then writes a
new Y, all other CPUs that subsequently read Y then a read X, will
access either the new Y and new X, the old Y and the new X, or old Y and
the old X. However, because the order of the write is strongly ordered
as write X first then write Y, n o CPU will access the new Y and the old
X .
Modern CPUs, such as ARM, optimize memory accesses via
hardware mechanisms avoiding memory accesses when it can, for example by
using a cache. When it has to access memory, it makes those memory
accesses as efficient as possible. An example of efficient memory
accesses is the speculative reading of memory into caches before the CPU
asks for it. This could happen when a program is accessing an array in
memory. Whilst the program is working on element N, the memory subsystem
could be reading element N+1 (and above) into cache ready for use.
These mechanisms can cause the memory accesses to be reordered .
Systems
that are allowed to reorder memory accesses in this way are called
weakly ordered memory systems. The upshot of this is that, whilst a
(program running on a) CPU has a coherent view of its own memory
accesses, those memory accesses will be observed differently by
(programs running on) other CPUs. To take my example above, another
CPU might easily access the new Y and the old X.
Unfortunately,
being able to observe the same order of memory accesses is key when
accessing memory atomically. If, in the above example, X is a spinlock
and Y is the data structure being modified within the critical region
protected by X, it is important that the accesses to X are observed by
all other CPUs in the system as being before accesses to Y. What you
need is some mechanism to make sure that all CPUs observe accesses to X
before the accesses to Y and this mechanism is a memory barrier.
Taking load and store exclusives and memory barriers into account, an assembly version of acquiring a spinlock looks like this:
CODE
ENTRY (_spin_lock)
mov r1,#1
1: ldrex r2,[r0]
cmp r2,#0
strexeq r2, r1, [r0]
cmpeq r2, #0
bne 1b
dmb
mov r0,#0
END (_spin_lock)
As
I described in the previous article, a pair of load and store
exclusives are used to acquire the lock. The ‘cmpeq r2, #0’ instruction
checks that the exclusive store to the lock location failed, which it
will do if another CPU has accessed the lock between the LDREX and STREX
operations. A failure will cause the code to retry the load exclusive.
You
will note the Data Memory Barrier (DMB) instruction that is issued once
the lock has been acquired. The DMB guarantees that all memory accesses
before the memory barrier will be observed by all of the other CPUs in
the system before all memory accesses made after the memory barrier.
This makes more sense if you consider that once a lock has been
acquired, a program will then access the data structure(s) locked by the
lock. The DMB in the lock function above ensures that accesses to the
locked data structure are observed after accesses to the lock.
The unlock code looks like:
CODE
ENTRY (_spin_unlock)
mov r1, #0
dmb
str r1, [r0]
END (_spin_unlock)
The
write to unlock the spinlock uses a simple STR instruction (not a store
exclusive). This is fine if there is only one owner of the lock, the
most common case. If a critical region allows more than one CPU to
access it, load and store exclusives would need to be used.
The DMB instruction ensures that accesses to the locked data structure(s) are observed before the lock is unlocked .