Thursday, June 12, 2014

Order in Structure Assignment Statements

I had what I thought was a pretty simple piece of code, transferring a structure from one thread to another.  Without describing it blow-by-blow, I'll present it thus: a writer thread is measuring temperature and pressure of a steam turbine.  A reader periodically samples the measurement.  There is no need for the reader to get every measurement, it only needs the most-recent at the point that it samples.


DESIGN!

A global data structure:

struct measurement_s {
    volatile uint64_t post_seq;  /* sequence number of measurement */
    volatile  int64_t temperature;
    volatile  int64_t pressure;
    volatile uint64_t pre_seq;   /* sequence number of measurement */
} live;

The algorithm is for the writer to first increment pre_seq, then update the temperature and pressure, then increment the post_seq.  The reader reads the post_seq first, then temperature and pressure, then pre_seq.  If the reader sees pre_seq != post_seq, it means that a collision has happened, and the reader should re-try.  Here's the reader code:

        struct measurement_s sample;
        do {
            sample = live;
        } while (sample.pre_seq != sample.post_seq);
        /* now have valid sample */


BROKEN!

Pretty simple.  Easy to understand.  And it didn't work right.  The test code was written to make it obvious when inconsistent values are read, and the code above generated inconsistent reads.  A LOT of inconsistent reads.  How can this be???  I used volatile to prevent the optimizer from messing with the order of reads and writes, and it's an x86-64, which has a very predictable memory model, removing the need for most hardware memory barriers.  The algorithm is correct!  Why is it failing?


DIAGNOSE!

So I did what I've been doing a surprising amount recently: I generated assem language (this time withOUT optimization turned on).  Here's the structure assignment:

    leaq    -64(%rbp), %rax
    movq    _live+24(%rip), %rcx
    movq    %rcx, 24(%rax)
    movq    _live+16(%rip), %rcx
    movq    %rcx, 16(%rax)
    movq    _live+8(%rip), %rcx
    movq    %rcx, 8(%rax)
    movq    _live(%rip), %rcx
    movq    %rcx, (%rax)

HEY!  It copied the structure in the REVERSE ORDER than I expected!  So it copies the pre_seq first, then the temperature and pressure, and post_seq last!  No wonder it detected collisions, that order is NOT thread-safe!


HAPPILY EVER AFTER!

So I replaced the structure assign with explicit copies of the fields:

        struct measurement_s sample;
        do {
            /* Don't do "sample = live;", order of field copies is not defined. */
            sample.post_seq = live.post_seq;
            sample.temperature = live.temperature;
            sample.pressure = live.pressure;
            sample.pre_seq = live.pre_seq;
        } while (sample.pre_seq != sample.post_seq);

YAY!  It works.  (Whew!)

It would also be possible to define a sub-structure containing temperature, pressure, and any additional interesting data.  The fields in the sub-structure could be copied as struct assign, allowing the compiler to do it in any order.  But the pre_seq and post_seq must be individually copied in the right order.


DIGRESSIONS!

I do love those exclamation points!

So, the above code works.  How *good* is the design?  For physical process measurement, probably just fine.  It would be rare to measure physical characteristics more often than a few times per second, so the chances of the reader having to re-try the sample are very low.  (On the other hand, why not use mutexes if you are going that slow?)

But suppose this algorithm is applied to market data, which might have many thousands of ticks per second?  The algorithm shown suffers from a fundamental flaw: starvation.  If a writer and reader get pathologically synchronized, there is no bound to the number of times the reader will cycle, trying to get a clean sample.

In practice, it might be fine.  But whenever possible, I prefer a solution that works by design, not by circumstance.

I think a better approach might be to use a non-blocking queue, like this one.  The writer presumably requires non-zero work to produce a new record.  The consumer should read the queue dry in a tight loop before processing the last record read.  Now the queue need only be long enough to hold the maximum number of produced records between two reads.  A slow reader will simply clean out the queue each time it reads.


Some might question the wisdom of doing ANY kind of programming where the order of field copies in a struct assign is important.  In the old days, this program would have been done with a critical section, protected by a mutex or semaphore.  That works fine, and doesn't depend on order.  But high-performance applications of today need much lower latencies.  So lock-free algorithms are important.  With lock-free, you need to concern yourself with order; the compiler can change the order of the assem language operations from the source code, and the hardware can change the order of physical memory operations from the assem code.  Compile optimizer barriers address the former; hardware memory barriers address the latter.


Finally, some will question my choice of making all the fields of the structure volatile.  I admit it is a heavy-handed approach that I used for expediency -- this was a prototype after all.  For production code I would use a more-precisely targeted VOL macro (mentioned in an earlier post), often in cooperation with a general compile optimizer barrier, like "asm volatile ("" ::: "memory");".  Note that these techniques are NOT used to squeeze every nanosecond out of the code.  They are used to insure code correctness.

No comments: