The Meditative Coder: multithreading

Thursday, June 12, 2014

Order in Structure Assignment Statements

I had what I thought was a pretty simple piece of code, transferring a structure from one thread to another. Without describing it blow-by-blow, I'll present it thus: a writer thread is measuring temperature and pressure of a steam turbine. A reader periodically samples the measurement. There is no need for the reader to get every measurement, it only needs the most-recent at the point that it samples.

DESIGN!

A global data structure:

struct measurement_s {

    volatile uint64_t post_seq;  /* sequence number of measurement */

    volatile  int64_t temperature;

    volatile  int64_t pressure;

    volatile uint64_t pre_seq;   /* sequence number of measurement */

} live;

The algorithm is for the writer to first increment pre_seq, then update the temperature and pressure, then increment the post_seq. The reader reads the post_seq first, then temperature and pressure, then pre_seq. If the reader sees pre_seq != post_seq, it means that a collision has happened, and the reader should re-try. Here's the reader code:

        struct measurement_s sample;

        do {

            sample = live;

        } while (sample.pre_seq != sample.post_seq);

        /* now have valid sample */

BROKEN!

Pretty simple. Easy to understand. And it didn't work right. The test code was written to make it obvious when inconsistent values are read, and the code above generated inconsistent reads. A LOT of inconsistent reads. How can this be??? I used volatile to prevent the optimizer from messing with the order of reads and writes, and it's an x86-64, which has a very predictable memory model, removing the need for most hardware memory barriers. The algorithm is correct! Why is it failing?

DIAGNOSE!

So I did what I've been doing a surprising amount recently: I generated assem language (this time withOUT optimization turned on). Here's the structure assignment:

    leaq    -64(%rbp), %rax

    movq    _live+24(%rip), %rcx

    movq    %rcx, 24(%rax)

    movq    _live+16(%rip), %rcx

    movq    %rcx, 16(%rax)

    movq    _live+8(%rip), %rcx

    movq    %rcx, 8(%rax)

    movq    _live(%rip), %rcx

    movq    %rcx, (%rax)

HEY! It copied the structure in the REVERSE ORDER than I expected! So it copies the pre_seq first, then the temperature and pressure, and post_seq last! No wonder it detected collisions, that order is NOT thread-safe!

HAPPILY EVER AFTER!

So I replaced the structure assign with explicit copies of the fields:

        struct measurement_s sample;

        do {

            /* Don't do "sample = live;", order of field copies is not defined. */

            sample.post_seq = live.post_seq;

            sample.temperature = live.temperature;

            sample.pressure = live.pressure;

            sample.pre_seq = live.pre_seq;

        } while (sample.pre_seq != sample.post_seq);

YAY! It works. (Whew!)

It would also be possible to define a sub-structure containing temperature, pressure, and any additional interesting data. The fields in the sub-structure could be copied as struct assign, allowing the compiler to do it in any order. But the pre_seq and post_seq must be individually copied in the right order.

DIGRESSIONS!

I do love those exclamation points!

So, the above code works. How *good* is the design? For physical process measurement, probably just fine. It would be rare to measure physical characteristics more often than a few times per second, so the chances of the reader having to re-try the sample are very low. (On the other hand, why not use mutexes if you are going that slow?)

But suppose this algorithm is applied to market data, which might have many thousands of ticks per second? The algorithm shown suffers from a fundamental flaw: starvation. If a writer and reader get pathologically synchronized, there is no bound to the number of times the reader will cycle, trying to get a clean sample.

In practice, it might be fine. But whenever possible, I prefer a solution that works by design, not by circumstance.

I think a better approach might be to use a non-blocking queue, like this one. The writer presumably requires non-zero work to produce a new record. The consumer should read the queue dry in a tight loop before processing the last record read. Now the queue need only be long enough to hold the maximum number of produced records between two reads. A slow reader will simply clean out the queue each time it reads.

Some might question the wisdom of doing ANY kind of programming where the order of field copies in a struct assign is important. In the old days, this program would have been done with a critical section, protected by a mutex or semaphore. That works fine, and doesn't depend on order. But high-performance applications of today need much lower latencies. So lock-free algorithms are important. With lock-free, you need to concern yourself with order; the compiler can change the order of the assem language operations from the source code, and the hardware can change the order of physical memory operations from the assem code. Compile optimizer barriers address the former; hardware memory barriers address the latter.

Finally, some will question my choice of making all the fields of the structure volatile. I admit it is a heavy-handed approach that I used for expediency -- this was a prototype after all. For production code I would use a more-precisely targeted VOL macro (mentioned in an earlier post), often in cooperation with a general compile optimizer barrier, like "asm volatile ("" ::: "memory");". Note that these techniques are NOT used to squeeze every nanosecond out of the code. They are used to insure code correctness.

Monday, May 19, 2014

Fast, lock-free, non-blocking queue

In my copious spare time (insert laugh-track here), I have developed a fairly simple queue that features:

Fixed size (size specified at queue creation time).
Non-blocking (enqueuing to a full queue returns immediately with status; dequeuing from an empty queue returns immediately with status).
Single Producer, Single Consumer (SPSC).
No dynamic memory allocates or frees during enqueue and dequeue operations. Messages are void pointers (null pointers not allowed).
High performance. On my Macbook Air, streaming data through a queue averages 11 ns per message (queue size=32), while ping-pong latency is 69 ns (one-way).
Tested on Mac OSX 10.9 (Mavericks) and Linux 2.6 and 3.5 kernels. At present, I only recommend the 64-bit x86 processor family, due to the fact that I take advantage of its programmer-friendly memory model. In the future I hope to generalize it to be efficient on other processor types.

I'm making it available as public domain (or more-accurately, as CC0), which means you can use it for any purpose without restriction.

You can get the module and its documentation here: https://github.com/fordsfords/q/tree/gh-pages

In addition to the queue itself, I produced semi-literate internals documentation I'm rather proud of my semi-literate documentation system.

Sunday, May 18, 2014

Cache Line Size and Alignment

I've been doing some fiddling around with a high-performance queue, and have noticed something. There is a wide-spread belief that CPU caches have a line size of 64 bytes.

Alas, all the world is not X86.

Modern PowerPC cache line size is 128 bytes. Same with Itanium-2. IBM's system Z CPUs have a line size of 256 bytes.

I saw the results of this personally with our Itanium Linux machine. My queue performed VERY poorly. I increased the padding between critical fields to assume a 128-byte cache line, and magically the performance improved dramatically.

While I'm on the subject, I heard from a smart person that cache lines do not necessarily align themselves on the corresponding memory boundaries. I.e. assuming a 64-byte line size and a cache miss, a memory access to address 0 would load a cache line with 0-63. However, if that access were instead to address 32, it would load a cache line with 32-95.

However, I have not been able to verify this experimentally. I've tried on X86, SparcV9, and Itanium, and they all appear to align the caches to a multiple of the cache line size. I.e. an access to address 32 would load 0-63.

If anybody has clarification here, I would appreciate a note. Thanks.

UPDATE: another colleague of mine has stated categorically that order of access will NOT affect the mapping of cache lines to memory addresses. That matches my experimental evidence.

The Meditative Coder

Thursday, June 12, 2014

Order in Structure Assignment Statements

Monday, May 19, 2014

Fast, lock-free, non-blocking queue

Sunday, May 18, 2014

Cache Line Size and Alignment

About Me

Tags (see here)

Blog Archive