| ATOMIC_LOADSTORE(9) | Kernel Developer's Manual | ATOMIC_LOADSTORE(9) |
atomic_load_relaxed,
atomic_load_acquire,
atomic_load_consume,
atomic_store_relaxed,
atomic_store_release —
atomic and ordered memory operations
#include
<sys/atomic.h>
T
atomic_load_relaxed(const
volatile T *p);
T
atomic_load_acquire(const
volatile T *p);
T
atomic_load_consume(const
volatile T *p);
void
atomic_store_relaxed(volatile
T *p, T v);
void
atomic_store_release(volatile
T *p, T v);
These type-generic macros implement memory operations that are
atomic and that have memory ordering
constraints. Aside from atomicity and ordering, the load operations are
equivalent to *p and the store
operations are equivalent to
*p =
v. The pointer p must be
aligned, even on architectures like x86 which generally lack strict
alignment requirements; see SIZE
AND ALIGNMENT for details.
Atomic means that the memory operations cannot be fused or torn:
*p = v; x = *p;
*p = v; x = v;
*p will yield
v after
*p = v. For atomic
memory operations, the implementation
will not
assume that
For example,
atomic_store_relaxed(&flag, 1); while (atomic_load_relaxed(&flag)) continue;
may be used to set a flag and then busy-wait until another thread clears it, whereas
flag = 1; while (flag) continue;
may be transformed into the infinite loop
flag = 1; while (1) continue;
For example, if a 32-bit word w is written with
atomic_store_relaxed(&w, 0x00010002);then an interrupt, other thread, or other CPU reading it with
atomic_load_relaxed(&w) will never witness
it partially written, whereas
w = 0x00010002;might be compiled into a pair of separate 16-bit store instructions instead of one single word-sized store instruction, in which case other threads may see the intermediate state with only one of the halves written.
Atomic operations on any single object occur in a total order shared by all interrupts, threads, and CPUs, which is consistent with the program order in every interrupt, thread, and CPU. A single program without interruption or other threads or CPUs will always observe its own loads and stores in program order, but another program in an interrupt handler, in another thread, or on another CPU may issue loads that return values as if the first program's stores occurred out of program order, and vice versa. Two different threads might each observe a third thread's memory operations in different orders.
The memory ordering constraints make limited guarantees of ordering relative to memory operations on other objects as witnessed by interrupts, other threads, or other CPUs, and have the following meanings:
*p and *p =
v.
Atomic operations with relaxed ordering are cheap: they are not read/modify/write atomic operations, and they do not involve any kind of inter-CPU ordering barriers.
int x = *p;
if (atomic_load_acquire(q)) {
int y = *r;
*s = x + y;
return 1;
}
as if it were
if (atomic_load_acquire(q)) {
int x = *p;
int y = *r;
*s = x + y;
return 1;
}
but not as if it were
int x = *p;
int y = *r;
*s = x + y;
if (atomic_load_acquire(q)) {
return 1;
}
For example, the implementation is allowed to treat
struct foo *foo0, *foo1; struct foo *f0 = atomic_load_consume(&foo0); struct foo *f1 = atomic_load_consume(&foo1); int x = f0->x; int y = f1->y;
as if it were
struct foo *foo0, *foo1; struct foo *f1 = atomic_load_consume(&foo1); struct foo *f0 = atomic_load_consume(&foo0); int y = f1->y; int x = f0->x;
but loading f0->x is guaranteed to
happen after loading foo0 even if the CPU had a
cached value for the address that f0->x
happened to be at, and likewise for f1->y and
foo1.
atomic_load_consume()
functions like atomic_load_acquire() as long as
the memory operations that must happen after it are limited to addresses
that depend on the value returned by it, but it is almost always as
cheap as atomic_load_relaxed(). See
ACQUIRE OR CONSUME? below
for more details.
int x = *p; *q = x; atomic_store_release(r, 0); int y = *s; return x + y;
as if it were
int y = *s; int x = *p; *q = x; atomic_store_release(r, 0); return x + y;
but not as if it were
atomic_store_release(r, 0); int x = *p; int y = *s; *q = x; return x + y;
In general, each
atomic_store_release()
must be
paired with either atomic_load_acquire() or
atomic_load_consume() in order to have an effect
— it is only when a release operation synchronizes with an acquire or
consume operation that any ordering guaranteed between memory operations
before
the release operation and memory operations
after
the acquire/consume operation.
For example, to set up an entry in a table and then mark the entry ready, you should:
tab[i].x = ...; tab[i].y = ...;
atomic_store_release()
to mark it ready.
atomic_store_release(&tab[i].ready, 1);
atomic_load_acquire() to ascertain whether it is
ready.
if (atomic_load_acquire(&tab[i].ready) == 0) return EWOULDBLOCK;
do_stuff(tab[i].x, tab[i].y);
Similarly, if you want to create an object, initialize it, and then publish it to be used by another thread, then you should:
struct mumble *m = kmem_alloc(sizeof(*m), KM_SLEEP); m->x = x; m->y = y; m->z = m->x + m->y;
atomic_store_release()
to publish it.
atomic_store_release(&the_mumble, m);
atomic_load_consume() to get it.
struct mumble *m = atomic_load_consume(&the_mumble);
m->y &= m->x; do_things(m->x, m->y, m->z);
In both examples, assuming that the
value written by
atomic_store_release()
in step 2 is read by atomic_load_acquire() or
atomic_load_consume() in step 3, this
guarantees that all of the memory operations in step 1 complete
before any of the memory operations in step 4 — even if they
happen on different CPUs.
Without both the release operation in
step 2 and the
acquire or consume operation in step 3, no ordering is guaranteed
between the memory operations in steps 1 and 4. In fact,
without both release and acquire/consume, even the
assignment
m->z = m->x + m->y
in step 1 might read values of m->x and
m->y that were written in step 4.
You must use
atomic_load_acquire()
when subsequent memory operations in program order that must happen after
the load are on objects at addresses that might not depend
arithmetically on the resulting value. This applies particularly when
the choice of whether to do the subsequent memory operation depends on a
control-flow
decision based on the resulting value:
struct gadget {
int ready, x;
} the_gadget;
/* Producer */
the_gadget.x = 42;
atomic_store_release(&the_gadget.ready, 1);
/* Consumer */
if (atomic_load_acquire(&the_gadget.ready) == 0)
return EWOULDBLOCK;
int x = the_gadget.x;
Here the
decision of whether to
load the_gadget.x depends on a control-flow
decision depending on value loaded from
the_gadget.ready, and loading
the_gadget.x must happen after loading
the_gadget.ready. Using
atomic_load_acquire()
guarantees that the compiler and CPU do not conspire to load
the_gadget.x before we have ascertained that it is
ready.
You may use
atomic_load_consume()
if all subsequent memory operations in program order that must happen after
the load are performed on objects at addresses computed
arithmetically from the resulting value, such as loading a pointer to a
structure object and then dereferencing it:
struct gizmo {
int x, y, z;
};
struct gizmo null_gizmo;
struct gizmo *the_gizmo = &null_gizmo;
/* Producer */
struct gizmo *g = kmem_alloc(sizeof(*g), KM_SLEEP);
g->x = 12;
g->y = 34;
g->z = 56;
atomic_store_release(&the_gizmo, g);
/* Consumer */
struct gizmo *g = atomic_load_consume(&the_gizmo);
int y = g->y;
Here the
address of
g->y depends on the value of the pointer loaded
from the_gizmo. Using
atomic_load_consume()
guarantees that we do not witness a stale cache for that address.
In some cases it may be unclear. For example:
int x[2]; bool b; /* Producer */ x[0] = 42; atomic_store_release(&b, 0); /* Consumer 1 */ int y = atomic_load_???(&b) ? x[0] : x[1]; /* Consumer 2 */ int y = x[atomic_load_???(&b) ? 0 : 1]; /* Consumer 3 */ int y = x[atomic_load_???(&b) ^ 1];
Although the three consumers seem to be
equivalent, by the letter of C11 consumers 1 and 2 require
atomic_load_acquire()
because the value determines the address of a subsequent load only via
control-flow decisions in the ?: operator, whereas
consumer 3 can use atomic_load_consume().
However, if you're not sure, you should err on the side of
atomic_load_acquire() until C11 implementations have
ironed out the kinks in the semantics.
On all CPUs other than DEC Alpha,
atomic_load_consume()
is cheap — it is identical to
atomic_load_relaxed(). In contrast,
atomic_load_acquire() usually implies an expensive
memory barrier.
The pointer p must be aligned — that is, if the object it points to is 2^n bytes long, then the low-order n bits of p must be zero.
All NetBSD ports support cheap atomic
loads and stores on units of data up to 32 bits. Some ports additionally
support cheap atomic loads and stores on 64-bit quantities if
__HAVE_ATOMIC64_LOADSTORE is defined. The macros are
not allowed on larger quantities of data than the port supports atomically;
attempts to use them for such quantities should result in a compile-time
assertion failure.
For example, as long as you use
atomic_store_*()
to write a 32-bit quantity, you can safely use
atomic_load_relaxed() to optimistically read it
outside a lock, but for a 64-bit quantity it must be conditional on
__HAVE_ATOMIC64_LOADSTORE — otherwise it will
lead to compile-time errors on platforms without 64-bit atomic loads and
stores:
struct foo {
kmutex_t f_lock;
uint32_t f_refcnt;
uint64_t f_ticket;
};
if (atomic_load_relaxed(&foo->f_refcnt) == 0)
return 123;
#ifdef __HAVE_ATOMIC64_LOADSTORE
if (atomic_load_relaxed(&foo->f_ticket) == ticket)
return 123;
#endif
mutex_enter(&foo->f_lock);
if (foo->f_refcnt == 0 || foo->f_ticket == ticket)
ret = 123;
...
#ifdef __HAVE_ATOMIC64_LOADSTORE
atomic_store_relaxed(&foo->f_ticket, foo->f_ticket + 1);
#else
foo->f_ticket++;
#endif
...
mutex_exit(&foo->f_lock);
Some ports support expensive 64-bit atomic read/modify/write
operations, but not cheap 64-bit atomic loads and stores. For example, the
armv7 instruction set includes 64-bit ldrexd and
strexd loops (load-exclusive, store-conditional)
which are atomic on 64-bit quantities. But the cheap 64-bit
ldrd / strd instructions are only atomic on 32-bit
accesses at a time. These ports define
__HAVE_ATOMIC64_OPS but not
__HAVE_ATOMIC64_LOADSTORE, since they do not have
cheaper 64-bit atomic load/store operations than the full atomic
read/modify/write operations.
These macros are meant to follow C11 semantics, in terms of
atomic_load_explicit() and
atomic_store_explicit() with the appropriate memory
order specifiers, and are meant to make future adoption of the C11 atomic
API easier. Eventually it may be mandatory to use the C11
_Atomic type qualifier or equivalent for the
operands.
The Linux kernel provides two macros
READ_ONCE(x) and
WRITE_ONCE(x, v) which are similar to
atomic_load_consume(&x) and
atomic_store_relaxed(&x, v),
respectively. However, while Linux's READ_ONCE and
WRITE_ONCE prevent fusing, they may in some cases be
torn — and therefore fail to guarantee atomicity —
because:
&x to be
aligned.sizeof(x) to be at most the
largest size of available atomic loads and stores on the host
architecture.The atomic read/modify/write operations in
atomic_ops(3) have relaxed
ordering by default, but can be combined with the memory barriers in
membar_ops(3) for the same
effect as an acquire operation and a release operation for the purposes of
pairing with
atomic_store_release()
and atomic_load_acquire() or
atomic_load_consume(). If
atomic_r/m/w() is an atomic read/modify/write
operation in
atomic_ops(3), then
membar_release(); atomic_r/m/w(obj, ...);
functions like a release operation on obj, and
atomic_r/m/w(obj, ...); membar_acquire();
functions like a acquire operation on obj.
On architectures where
__HAVE_ATOMIC_AS_MEMBAR is defined, all the
atomic_ops(3) imply
release and acquire operations, so the
membar_acquire(3) and
membar_release(3) are
redundant.
The combination of
atomic_load_relaxed()
and membar_acquire(3)
in that order is equivalent to
atomic_load_acquire(), and the combination of
membar_release(3) and
atomic_store_relaxed()
in that order is equivalent to
atomic_store_release().
Maintaining lossy counters. These may lose some counts, because the read/modify/write cycle as a whole is not atomic. But this guarantees that the count will increase by at most one each time. In contrast, without atomic operations, in principle a write to a 32-bit counter might be torn into multiple smaller stores, which could appear to happen out of order from another CPU's perspective, leading to nonsensical counter readouts. (For frequent events, consider using per-CPU counters instead in practice.)
unsigned count;
void
record_event(void)
{
atomic_store_relaxed(&count,
1 + atomic_load_relaxed(&count));
}
unsigned
read_event_count(void)
{
return atomic_load_relaxed(&count);
}
Initialization barrier.
int ready;
struct data d;
void
setup_and_notify(void)
{
setup_data(&d.things);
atomic_store_release(&ready, 1);
}
void
try_if_ready(void)
{
if (atomic_load_acquire(&ready))
do_stuff(d.things);
}
Publishing a pointer to the current snapshot of data. (Caller must
arrange that only one call to take_snapshot()
happens at any given time; generally this should be done in coordination
with pserialize(9) or
similar to enable resource reclamation.)
struct data *current_d;
void
take_snapshot(void)
{
struct data *d = kmem_alloc(sizeof(*d));
d->things = ...;
atomic_store_release(¤t_d, d);
}
struct data *
get_snapshot(void)
{
return atomic_load_consume(¤t_d);
}
sys/sys/atomic.h
These atomic operations first appeared in NetBSD 9.0.
C11 formally specifies that all subexpressions, except the left
operands of the ‘&&’,
‘||’,
‘?:’, and
‘,’ operators and the
kill_dependency() macro, carry dependencies for
which memory_order_consume guarantees ordering, but
most or all implementations to date simply treat
memory_order_consume as
memory_order_acquire and do not take advantage of
data dependencies to elide costly memory barriers or load-acquire CPU
instructions.
Instead, we implement
atomic_load_consume() as
atomic_load_relaxed() followed by
membar_datadep_consumer(3),
which is equivalent to
membar_consumer(3) on
DEC Alpha and
__insn_barrier(3)
elsewhere.
Some idiot decided to call it tearing, depriving us of the opportunity to say that atomic operations prevent fusion and fission.
| February 11, 2022 | NetBSD 11.0 |