Showing posts with label safe code. Show all posts
Showing posts with label safe code. Show all posts

Tuesday, September 11, 2018

Safe sscanf() Usage

The scanf() family of functions (man page) are used to parse ascii input into fields and convert those fields into desired data types.  I have very rarely used them, due to the nature of the kind of software I write (system-level, not application), and I have always viewed the function with suspicion.

Actually, scanf() is not safe.  Neither is fscanf().  But this is only because when an input conversion fails, it is not well-defined where the file pointer is left.  Fortunately, this is pretty easy to deal with - just use fgets() to read a line and then use sscanf().

You can use sscanf() safely, but you need to follow some rules.  For one thing, you should include field widths for pretty much all of the fields, not just strings.  More on the rules later.


Input Validity Checking with sscanf()

I am a great believer in doing a good job of checking validity of inputs (see strtol preferred over atoi).  It's not enough to process your input "safely" (i.e. preventing buffer overflows, etc.), you should also detect data errors and report them.  For example, if I want to input the value "21" and I accidentally type "2`", I want the program to complain and re-prompt, not simply take the value "2" and ignore the "`".

One problem with doing a good job of input validation is that it often leads to lots of checking code.  In strtol preferred over atoi, my preferred code consumes 4 lines, and even that is cheating (combining two statements on one line, putting close brace at end of line).  And that's for a single integer!  What if you have a line with multiple pieces of information on it?

Let's say I have input lines with 3 pipe-separated fields:

  id_num|name|age

where "id_num" and "age" are 32-bit unsigned integers and "name" is alphabetic with a few non-alphas.  We'll further assume that a name must not contain a pipe character.

Bottom line: here is my preferred code:

#define NAME_MAX_LEN 60
. . .
unsigned int id;
char name[NAME_MAX_LEN+1];  /* allow for null */
unsigned int age;
int null_ofs = 0;

(void)sscanf(iline,
    "%9u|"  /* id */
    "%" STR(NAME_MAX_LEN) "[-A-Za-z,. ]|"  /* name */
    "%3u\n"  /* age */
    "%n",  /* null_ofs */
    &id, name, &age, &null_ofs);

if (null_ofs == 0 || iline[null_ofs] != '\0') {
    printf("PARSE ERROR!\n");
}


Error Checking

My long-time readers (all one of you!) will take issue with me throwing away the sscanf() return value.  However, it turns out that sscanf()'s return value really isn't as useful as it should be.  It tells you how many successful conversions were done, but doesn't tell you if there was extra garbage at the end of the line.  A better way to detect success is to add the "%n" at the end to get the string offset where conversion stopped, and make sure it is at the end of the string.

If any of the conversions fail, the "%n" will not be assigned, leaving it at 0, which is interpreted as an error.  Or, if all conversions succeed, but there is garbage at the end, "iline[null_ofs]" will not be pointing at the string's null, which is also interpreted as an error.


STR Macro

More interesting is the "STR(NAME_MAX_LEN)" construct.  This is a specialized macro I learned about on StackOverflow:

#define STR2(_s) #_s
#define STR(_s) STR2(_s)

Yes, both macros are needed.  So the construct:
  "%" STR(NAME_MAX_LEN) "[-A-Za-z,. ]"
gets compiled to the string:
  "%60[-A-Za-z,. ]"

So why use that macro?  If I had hard-coded the format string as "%60[-A-Za-z,. ]", it would risk getting out of sync with the actual size of "name" if we find somebody with a 65 character name.  (I would like it even more if I could use "sizeof(name)", but macro expansion and "sizeof()" are handled by different compiler phases.)


Newline Specifier

This one a little subtle.  The third conversion specifier is "%3u\n".  This tells sscanf() to match newline at the end of the line.  But guess what!  If I don't include a newline on the string, the sscanf() call still succeeds, including setting null_ofs to the input string's null.  In my opinion, this is a slight imperfection in sscanf(): I said I needed a newline, but sscanf() accepts my input without the newline.  I suspect sscanf() is counting newline as whitespace, which it knows to skip over.

If I *really* want to require the newline, I can do:

if (iline[null_ofs - 1] != '\n') then ERROR...


Integer Field Widths

Finally, I add field widths to the unsigned integer conversions.  Why?  If you leave it out, then it will do a fine job of converting numbers from 0..4294967295.  But try the next number, 4294967296, and what will you get?  Arithmetic overflow, with a result of 0 (although the C standards do not guarantee that value).  And sscanf() will not return any error.  So I restrict the number of digits to prevent overflow.  The disadvantage is that with "%9u" you cannot input the full range of an unsigned integer.  An alternative is to use "%10Lu" and supply a long long variable.  Then range check it in your code.  Or you could just use "%8x" and have the user input hexidecimal.  Heck, you could even input it as a 10-character string and use strtol()!

I guess "%9u" was good enough for my needs.  I just wish you could just tell sscanf() to stop converting on arithmetic overflow!


Not Perfect

I already mentioned arithmetic overflow. Any other problems with the above code?  Well, yeah, I guess.  Even though I specify "id" and "age" to be unsigned integers, I notice that sscanf() will allow negative numbers to be inputted.  Sometimes we engineers consider this a feature, not a bug; we like to use 0xFFFFFFFF as a flag value, and it's easier to type "-1".  But I admit it still bothers me.  If I input an age of -1, we end up with the 4-billiion-year-old man (apologies to Mel Brooks).


Tuesday, September 4, 2018

Safe C?

A Slashdot post led me to some good pages related to C safety.
(Update: actually I did go ahead and order the tee shirt.)


Monday, August 27, 2018

Safer Malloc for C

Here's a fragment of code that I recently wrote.  See anything wrong with it?

#define VOL(_var, _type) (*(volatile _type *)&(_var))
. . .
lbmbt_rcv_binding_t *binding = NULL;
. . .
binding = (lbmbt_rcv_binding_t *)malloc(sizeof(lbmbt_rcv_t));
memset(binding, 0xa5, sizeof(lbmbt_rcv_t));
binding->signature = SIGNATURE_OK;
. . .
VOL(binding->signature), int) = SIGNATURE_FREE;
free(binding);

No, the problem isn't in that strange VOL() macro; that is explained here.

The problem is that I am trying to allocate an "lbmbt_rcv_binding_t" structure, but because of cut-and-past programming and being in a hurry, I used the size of a different (and smaller) structure: lbmbt_rcv_t.  Since the "signature" field is the last field in the structure, the assignments to it write past the end of the allocated memory block.  GAH!

But we all knew that C is a dangerous language, with its casting and sizeof just begging to be used wrong.  But could it be at least a *little* safer?


A Safer Malloc

#define VOL(_var, _type) (*(volatile _type *)&(_var))

/* Malloc "n" copies of a type. */
#define SAFE_MALLOC(_var, _type, _n) do {\
  _var = (_type *)malloc((_n) * sizeof(_type));\
  if (_var == NULL) abort();\
} while (0)
. . .
lbmbt_rcv_binding_t *binding = NULL;
. . .
SAFE_MALLOC(binding, lbmbt_rcv_t, 1);
memset(binding, 0xa5, sizeof(lbmbt_rcv_t));
binding->signature = SIGNATURE_OK;
. . .
VOL(binding->signature), int) = SIGNATURE_FREE;
free(binding);

The above code still has the bug -- it passed in the wrong type -- but at least it generates a compile warning.  The cast of malloc doesn't match the type of the variable.  Since I insist on getting rid of compiler warnings, that would have flagged the bug to me.


If you don't need portability, you can even use the gcc-specific extension "typeof()" in BOTH macros:

#define VOL(_var) (*(volatile typeof(_var) *)&(_var))

/* Malloc "n" copies of a type. */
#define SAFE_MALLOC(_var, _n) do {\
  _var = (typeof(*_var) *)malloc((_n) * sizeof(typeof(*_var)));\
  if (_var == NULL) abort();\
} while (0)
. . .
lbmbt_rcv_binding_t *binding = NULL;
. . .
SAFE_MALLOC(binding, 1);
memset(binding, 0xa5, sizeof(lbmbt_rcv_t));
binding->signature = SIGNATURE_OK;
. . .
VOL(binding->signature) = SIGNATURE_FREE;
free(binding);

Now the malloc bug is gone ... by design!


Safer Malloc and Memset

Finally, notice that the memset is also wrong.  Since I frequently like to init my malloced segments to a known pattern (0xa5 is my habit), I can define a second macro:

#define VOL(_var) (*(volatile typeof(_var) *)&(_var))

/* Malloc "n" copies of a type and set its contents to a pattern. */
#define SAFE_MALLOC_SET(_var, _pat, _n) do {\
  _var = (typeof(*_var) *)malloc((_n) * sizeof(typeof(*_var)));\
  if (_var == NULL) abort();\
  memset(_var, _pat, (_n) * sizeof(typeof(*_var)));\
} while (0)
. . .
lbmbt_rcv_binding_t *binding = NULL;
. . .
SAFE_MALLOC_SET(binding, 0xa5, 1);
binding->signature = SIGNATURE_OK;
. . .
VOL(binding->signature) = SIGNATURE_FREE;
free(binding);

No more bugs.


P.S.

It took me WAY longer than it should have to track this down (an entire weekend).  Why?  The bug manifested as a segmentation fault.  So I figured I could just valgrind it.  But lo and behold, I didn't have Valgrind on my Mac.  "No Problem," sez I.  I sez, "I'll just install it."

$ brew install valgrind
valgrind: This formula either does not compile or function as expected on macOS
versions newer than Sierra due to an upstream incompatibility.

GAH!  "Still no problem," sez I.  I sez, "I'll just debug it on Linux."

No seg fault.  Program worked fine.  Valgrind makes no complaint.  So I spent a the weekend doing a divide-and-conquer trackdown of the bug in a large codebase to find the "Mac-specific" bug.  And finally found the bad malloc.

But wait!  That's not Mac-specific!  What's going on here?  After wasting some more time, I finally printed the sizeof() for each structure type.  On Mac, the sizes are different.  On Linux, the structures are the same size.  Of course valgrind on Linux said it was OK -- the malloc size was correct on Linux!

And now I have Valgrind on my mac  which pinpointed the bug immediately.  (Thanks Güngör Budak!).

Sunday, June 26, 2016

snprintf: bug detector or bug preventer?

Pop quiz time!

When you use snprintf() instead of sprintf(), are you:
   A. Writing code that proactively detects bugs.
   B. Writing code that proactively prevents bugs.

Did you answer "B"?  TRICK QUESTION!  The correct answer is:
  C. Writing code that proactively hides bugs.

Here's a short program that takes a directory name as an argument and prints the first line of the file "tst.c" in that directory:
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv)
{
  char path[20];
  char iline[4];
  snprintf(path, sizeof(path), "%s/tst.c", argv[1]);
  FILE *fp = fopen(path, "r");
  fgets(iline, sizeof(iline), fp);
  fclose(fp);
  printf("iline='%s'\n", iline);
  return 0;
}
Nice and safe, right?  Both snprintf() and fgets() do a great job of not overflowing their buffers.  Let's run it:

$ ./tst .
iline='#inc'

Hmm ... didn't get the full input line.  I guess my iline array was too small.  But hey, at least it didn't seg fault, like it might have if I had just used scanf() or something dangerous like that!  No seg faults for me.

$ ./tst ././././././././.
Segmentation fault: 11

Um ... oh, silly me.  My path array was too small.  fopen() failed, and I didn't check its return status.

So I could, and should, check fopen()'s return status.  But that just gives me a more user-friendly error message.  It doesn't tell my *why* the file name is wrong.  Imagine the snprintf() being in a completely different area of the code.  Yes, you discover there's a bug by checking fopen(), but it's nowhere near where the bug actually is.  Same thing, by the way, with the fgets() not reading the entire line.  Who knows how much more code is going to be executed before the program misbehaves because it didn't get the entire line?

And that is my point.  Most of these "safe" functions work the same way: you pass in the size of your buffer, and the functions guarantee that they won't overrun your buffer, but give you *NO* indication that they truncated. I.e. they don't tell you when your buffer is too small.  It's not until later that something visibly misbehaves, and that wastes time and effort working your way back to the root cause.

Now I'm not suggesting that we throw away snprintf() in favor of sprintf().  I'm suggesting that using snprintf() is only half the job.  How about this:

#include <stdio.h>
#include <string.h>
#include <assert.h>
#define BUF2SMALL(_s) do {\
  assert(strnlen(_s, sizeof(_s)) < sizeof(_s)-1);\
} while (0)

int main(int argc, char **argv)
{
  char path[21];
  char iline[5];
  snprintf(path, sizeof(path), "%s/tst.c", argv[1]); BUF2SMALL(path);
  FILE *fp = fopen(path, "r");  assert(fp != NULL);
  fgets(iline, sizeof(iline), fp); BUF2SMALL(iline);
  fclose(fp);
  printf("iline='%s'\n", iline);
  return 0;
}

Now let's run it:

$ ./tst ./.
Assertion failed: (strnlen(iline, sizeof(iline)) < sizeof(iline)-1), function main, file tst.c, line 15.
Abort trap: 6
$ ./tst ././././././././.
Assertion failed: (strnlen(path, sizeof(path)) < sizeof(path)-1), function main, file tst.c, line 13.
Abort trap: 6

There.  My bugs are reported *much* closer to where they really are.

The essence of the BUF2SMALL() macro is that you should use a buffer which is at least one character larger than the maximum size you think you need.  So if you want an answer string to be able to hold either "yes" or "no", don't make it "char ans[4]", make it at least "char ans[5]".  BUF2SMALL() asserts an error if the string consumes the whole array.

One final warning.  Note that in BUF2SMALL() I use "strnlen()" instead of "strlen()".   I wrote BUF2SMALL() to be a general-purpose error checker after a variety of "safe" functions.  For example, maybe I want to use it after a "strncpy()".  Look at what the man page for "strncpy()" says:
Warning:  If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.
If you use "strncpy()" to copy a string, the string might not be null-terminated, and  strlen() has a good chance of segfaulting.  So I used strnlen(), which is only "safe" in that it won't segfault.  But it doesn't tell me that the string isn't null-terminated!  So I still need my macro to tell me that the buffer is too small.  The "safe" functions only make the fuse a little longer on the stick of dynamite in your program.

Saturday, June 25, 2016

Of compiler warnings and asserts in a throw-away society

Many people despair at today's "throw away" society.  If you don't want it, just throw it away.

Programmers know this is not a recent phenomenon; they've been throwing stuff away since the dawn of high-level languages.

Actual line from code I'm doing some work on:
    write(fd, str_gpio, len);

The "write" function returns a value, which the programmer threw away.  And I know why without even asking him.  If you were to challenge him, he would probably say, "I don't need the return value, and as for prudent error checking, this program has been running without a glitch for years."

Ugh.  It's never a *good* idea to throw away return values, but I've been known to do it.  But I really REALLY don't like compiler warnings:
warning: ignoring return value of 'write', declared with attribute warn_unused_result [-Wunused-result]
     write(fd, str_gpio, len);
     ^

Well, I didn't feel like analyzing the code to see how errors *should* be handled, so I just cast "write" to void to get rid of the compile warning:
    (void)write(fd, str_gpio, len);

Hmm ... still same warning.  Apparently over 10 years ago, glibc decided to make a whole lot of functions have an attribute that makes them throw that warning if the return value is ignored, and GCC decided that functions with that attribute will throw the warning *even if cast to void*.  If you like reading flame wars, the Interwebs are chock full of arguments over this.

And you know what?  Even though I'm not sure I agree with that GCC policy, it did cause me to re-visit the code and add some actual error checking.  I figured that if write() returning an error was something that "could never happen", then let's enshrine that fact in the code:
    s = write(fd, str_gpio, len);  assert(s == len);

Hmm ... different warning:
warning: unused variable 's' [-Wunused-variable]
     s = write(fd, str_gpio, len);  assert(s == len);
     ^

Huh?  I'm using it right there!  Back to Google.  Apparently, you can define a preprocessor variable to inhibit the assert code.  Some programmers like to have their asserts enabled during testing, but disabled for production for improved efficiency.  The compiler sees that the condition testing code is conditionally compiled, and decides to play it safe and throw the warning that "s" isn't used, even if the condition code is compiled in.  And yes, this also featured in the same flame wars over void casting.  I wasn't the first person to use exactly this technique to try to get rid of warnings.

*sigh*

So I ended up doing what lots of the flame war participants bemoaned having to do: writing my own assert:
#define ASSRT(cond_expr) do {\
  if (!(cond_expr)) {\
    fprintf(stderr, "ASSRT failed at %s:%d (%s)", __FILE__, __LINE__, #cond_expr);\
    fflush(stderr);\
    abort();\
} } while (0)
...
    s = write(fd, str_gpio, len);  ASSRT(s == len);

Finally, no warnings!  And better code too (not throwing away the return value).  I just don't like creating my own assert. :-(

Thursday, June 12, 2014

Order in Structure Assignment Statements

I had what I thought was a pretty simple piece of code, transferring a structure from one thread to another.  Without describing it blow-by-blow, I'll present it thus: a writer thread is measuring temperature and pressure of a steam turbine.  A reader periodically samples the measurement.  There is no need for the reader to get every measurement, it only needs the most-recent at the point that it samples.


DESIGN!

A global data structure:

struct measurement_s {
    volatile uint64_t post_seq;  /* sequence number of measurement */
    volatile  int64_t temperature;
    volatile  int64_t pressure;
    volatile uint64_t pre_seq;   /* sequence number of measurement */
} live;

The algorithm is for the writer to first increment pre_seq, then update the temperature and pressure, then increment the post_seq.  The reader reads the post_seq first, then temperature and pressure, then pre_seq.  If the reader sees pre_seq != post_seq, it means that a collision has happened, and the reader should re-try.  Here's the reader code:

        struct measurement_s sample;
        do {
            sample = live;
        } while (sample.pre_seq != sample.post_seq);
        /* now have valid sample */


BROKEN!

Pretty simple.  Easy to understand.  And it didn't work right.  The test code was written to make it obvious when inconsistent values are read, and the code above generated inconsistent reads.  A LOT of inconsistent reads.  How can this be???  I used volatile to prevent the optimizer from messing with the order of reads and writes, and it's an x86-64, which has a very predictable memory model, removing the need for most hardware memory barriers.  The algorithm is correct!  Why is it failing?


DIAGNOSE!

So I did what I've been doing a surprising amount recently: I generated assem language (this time withOUT optimization turned on).  Here's the structure assignment:

    leaq    -64(%rbp), %rax
    movq    _live+24(%rip), %rcx
    movq    %rcx, 24(%rax)
    movq    _live+16(%rip), %rcx
    movq    %rcx, 16(%rax)
    movq    _live+8(%rip), %rcx
    movq    %rcx, 8(%rax)
    movq    _live(%rip), %rcx
    movq    %rcx, (%rax)

HEY!  It copied the structure in the REVERSE ORDER than I expected!  So it copies the pre_seq first, then the temperature and pressure, and post_seq last!  No wonder it detected collisions, that order is NOT thread-safe!


HAPPILY EVER AFTER!

So I replaced the structure assign with explicit copies of the fields:

        struct measurement_s sample;
        do {
            /* Don't do "sample = live;", order of field copies is not defined. */
            sample.post_seq = live.post_seq;
            sample.temperature = live.temperature;
            sample.pressure = live.pressure;
            sample.pre_seq = live.pre_seq;
        } while (sample.pre_seq != sample.post_seq);

YAY!  It works.  (Whew!)

It would also be possible to define a sub-structure containing temperature, pressure, and any additional interesting data.  The fields in the sub-structure could be copied as struct assign, allowing the compiler to do it in any order.  But the pre_seq and post_seq must be individually copied in the right order.


DIGRESSIONS!

I do love those exclamation points!

So, the above code works.  How *good* is the design?  For physical process measurement, probably just fine.  It would be rare to measure physical characteristics more often than a few times per second, so the chances of the reader having to re-try the sample are very low.  (On the other hand, why not use mutexes if you are going that slow?)

But suppose this algorithm is applied to market data, which might have many thousands of ticks per second?  The algorithm shown suffers from a fundamental flaw: starvation.  If a writer and reader get pathologically synchronized, there is no bound to the number of times the reader will cycle, trying to get a clean sample.

In practice, it might be fine.  But whenever possible, I prefer a solution that works by design, not by circumstance.

I think a better approach might be to use a non-blocking queue, like this one.  The writer presumably requires non-zero work to produce a new record.  The consumer should read the queue dry in a tight loop before processing the last record read.  Now the queue need only be long enough to hold the maximum number of produced records between two reads.  A slow reader will simply clean out the queue each time it reads.


Some might question the wisdom of doing ANY kind of programming where the order of field copies in a struct assign is important.  In the old days, this program would have been done with a critical section, protected by a mutex or semaphore.  That works fine, and doesn't depend on order.  But high-performance applications of today need much lower latencies.  So lock-free algorithms are important.  With lock-free, you need to concern yourself with order; the compiler can change the order of the assem language operations from the source code, and the hardware can change the order of physical memory operations from the assem code.  Compile optimizer barriers address the former; hardware memory barriers address the latter.


Finally, some will question my choice of making all the fields of the structure volatile.  I admit it is a heavy-handed approach that I used for expediency -- this was a prototype after all.  For production code I would use a more-precisely targeted VOL macro (mentioned in an earlier post), often in cooperation with a general compile optimizer barrier, like "asm volatile ("" ::: "memory");".  Note that these techniques are NOT used to squeeze every nanosecond out of the code.  They are used to insure code correctness.

Tuesday, June 10, 2014

Error handling: the enemy of readability?

I've been a programmer since the mid-1970s, call it 40 years.  During that time I've seen the rise into popularity: structured programming and object-oriented programming.  (More-recently, I've seen increased interest in functional programming, and I've started learning a bit about it myself, but I'm not conversant-enough in it to write intelligently about it.  So I'll omit it from this discussion.)

After all these years, I find it very interesting that I haven't seen a common approach to internal error handling become widely-accepted, the way structure and object-orientation have.

I'm speaking here of internal errors, not user errors.  If the user enters bad input, you owe it to the user to detect the problem and report it in a user-friendly way, allowing the user to correct his mistake.  User error detection and handling code is fully part of the program.

Internal errors are different.  These are found by inserting sanity checks to see if something that "should never happen" might have actually happened.  Very often, failed sanity checks represent a software bug that needs to be tracked down and fixed.


ROBUST CODE vs. READABLE CODE?

Most approaches to internal error handling I've seen are poor.  Maybe it is partly due to a lack of programmer discipline - we all get lazy sometimes and forget to check a return status - but I don't think that is the primary reason.  I believe the main culprit is the need for readable code.

The greatest expense in any significant programming effort is not the original writing of code, it's the subsequent maintenance.  The original programmers eventually leave the project, and less-experienced programmers often take over the less-glamorous job of fixing bugs and adding enhancements.  It is critically important that the code be readable.  External documentation is typically out of date moments after it is written, and even internal documentation (comments) are suspect.  The code itself is the only reality that matters, and the harder it is to understand that code, the harder (and more expensive) it will be to maintain and improve it.

And unfortunately, doing a thorough job of checking and handling internal errors often makes the code harder to read.

C

Which is easier to read and maintain?  This:

example 1:
  foo_t *foo_create()
  {
      foo_t *new_foo = malloc(sizeof(foo_t));

      new_foo->state_filep = fopen("foo.state", "rw");
      new_foo->data_filep = fopen("foo.data", "rw");

      new_foo->reader = reader_create();
      new_foo->writer = writer_create();

      return new_foo;
  }  /* foo_create */

or this:

example 2:
  foo_t *foo_create()
  {
      foo_t *new_foo = malloc(sizeof(foo_t));
      if (new_foo != NULL) {  /* malloc successful? */

          new_foo->state_filep = fopen("foo.state", "rw");
          if (new_foo->state_filep != NULL) {  /* fopen successful? */
              new_foo->data_filep = fopen("foo.data", "rw");
              if (new_foo->data_filep != NULL) {  /* fopen successful? */

                  New_foo->reader = reader_create();
                  if (new_foo->reader != NULL) {  /* reader successful? */
                      new_foo->writer = writer_create();
                      if (new_foo->writer != NULL) {  /* writer successful? */

                          return new_foo;
                      }
                      else {
                          log_error("Could not create writer");
                          reader_delete(new_foo->reader);
                          fclose(new_foo->data_filep);
                          fclose(new_foo->state_filep);
                          free(new_foo);
                          return NULL;
                      }
                  }
                  else {
                      log_error("Could not create reader");
                      fclose(new_foo->data_filep);
                      fclose(new_foo->state_filep);
                      free(new_foo);
                      return NULL;
                  }
              }
              else {
                  log_error("Could not open data file");
                  fclose(new_foo->state_filep);
                  free(new_foo);
                  return NULL;
              }
          }
          else {
              log_error("Could not open state file");
              free(new_foo);
              return NULL;
          }
      }
      else {
          log_error("Could not malloc foo");
          return NULL;
      }
  }  /* foo_create */

Most projects I've worked on use a hybrid of the two approaches, but most lean heavily towards one or the other.  The first is easy to read, but brittle; if anything goes wrong, the problem often doesn't generate any visible symptoms till much later, making it a nightmare to diagnose.  The second detects errors early and does proper reporting and cleanup, but is a nightmare to read and maintain.  (All that duplicated code in the else clauses is sure to get out of sync at some point, resulting in resource leaks or uninitialized accesses, neither of which will be detected unless some error actually happens and the error handling code finally gets exercised.)

Wouldn't it be nice to have both good error detection and readability?

example 3:
  foo_t *foo_create()
  {
      foo_t *new_foo = NULL;
      new_foo = malloc(sizeof(foo_t));  ASSRT(new_foo!=NULL);
      memset(new_foo, 0, sizeof(new_foo));

      new_foo->state_filep = fopen("foo.state", "rw");  ASSRT(new_foo->state_filep!=NULL);
      new_foo->data_filep = fopen("foo.data", "rw");  ASSRT(new_foo->data_filep!=NULL);

      new_foo->reader = reader_create();  ASSRT(new_foo->reader!=NULL);
      new_foo->writer = writer_create();  ASSRT(new_foo->writer!=NULL);

      return new_foo;

  ASSRT_FAIL:
      if (new_foo != NULL) {
          if (new_foo->writer != NULL) reader_delete(new_foo0->writer);
          if (new_foo->reader != NULL) reader_delete(new_foo0->reader);
          if (new_foo->data_filep != NULL) fclose(new_foo0->data_filep);
          if (new_foo->state_filep != NULL) fclose(new_foo0->state_filep);
          memset(new_foo, 0, sizeof(new_foo));  /* poison the freed structure */
          asm volatile ("" ::: "memory");       /* See note [1] */
          free(new_foo);
      }

      return NULL;
  }  /* foo_create */

Note [1]: See http://blog.geeky-boy.com/2014/06/clangllvm-optimize-o3-understands-free.html

My philosophy is to catch errors as early as possible by including sanity checks, but doing it unobtrusively.  Make it easy to ignore if you're looking for the code's basic algorithm, but easy to understand if you are analyzing the error handling.  I think example 3 to be approximately as readable as the first example, and as robust as the second.  It is important to keep the error handling code visually small so that it can be appended to the ends of lines.  This:

      foo->state_filep = fopen("foo.state", "rw");  ASSRT(foo->state_filep!=NULL);

is much cleaner than this:

      foo->state_filep = fopen("foo.state", "rw");
      if (foo->state_filep) {
          blah blah blah
      }

As you've probably guessed, the ASSRT macro does nothing if the expression is true, or does a "goto ASSRT_FAIL;" if the expression is false.  A variation is to use "abort()" instead of goto.  In many ways this is FAR superior for a troubleshooter; a core file can provide a full stack trace and access to program state.  Calling "abort()" also simplifies the code further, removing explicit clean up.  But it also means that the code has given up on any hope of recovery.  Differentiating between an un-recoverable error and a recoverable error is often not possible at the point where the error is detected; it is usually considered good form to report the error and return a bad status.

(FYI - the "memset()" before the "free()" is commented as "poison the freed structure".  The purpose of this is to make sure that any dangling references to the foo object will seg fault the instant they try to de-reference one of the internal pointers.  Again, my philosophy is to find bugs as soon as possible.)


ASSRT()

What is this magical "ASSRT()" macro?

  #define ASSRT(cond_expr) do { \
      if (!(cond_expr)) { \
          char errmsg[256]; \
          snprintf(errmsg, sizeof(errmsg), "%s:%d, failed assert: (%s)", \
              __FILE__, __LINE__, #cond_expr); \
          log_error(errmsg); \
          goto ASSRT_FAIL; \
      }  \
  } while (0)

For those unfamiliar with C macro magic, there are two bits here that might benefit from explanation.  First the "do ... while (0)".  This is explained pretty well here.  It's the standard way to avoid tricky bugs when macros are used as part of if statements.  The second is the "#cond_expr" at the end of the sprintf.  The C preprocessor converts "#macro input parameter" into a C string.

So, the macro expansion of "new_foo = malloc(sizeof(foo_t));  ASSRT(new_foo!=NULL);" is:

      new_foo = malloc(sizeof(foo_t));  do {
          if (!(new_foo!=NULL)) {
              char errmsg[256];
              snprintf(errmsg, sizeof(errmsg), "%s:%d, condition false: '%s'",
                  "foo.c", 50, "new_foo!=NULL");
              log_error(errmsg);
              goto ASSRT_FAIL;
          }
      } while (0);

An important reason for making this a macro instead of a function is so that the __FILE__ and __LINE__ compiler built-ins apply to the source line which uses ASSRT.  I've also sometimes enhanced the ASSRT macro, making it specific to a program or even a function by including prints of interesting state information.  I've also sometimes included errno.  These improvements help the troubleshooter diagnose the internal problem.

One big drawback to this approach?  The error messages logged can only be loved by a programmer. What would an end user do with this?

  Error logger: foo.c:50, failed assert: (new_foo!=NULL)

This is wonderful for a programmer; I don't know how many times I've seen an error log containing, "could not open file", only to find that the string appears many times in the source code.  This pin-points the source file and line.  However, this message is virtually useless to an end user.

On the other hand, is it really any worse than this?

  Error logger: could not malloc foo

This looks less like geek-speak, but doesn't actually help an end user any more.  In both cases, the user has little choice but to complain to the developer.

My point is that including an error-specific, highly-descriptive message makes the error handling code longer and more-intrusive.  One goal of mine is to keep the ASSRT on the same line as the line it's checking; hard to do if including a descriptive message.  But as soon as it overflows onto subsequent lines, it interferes with the logical flow of the code, making it harder to read.  You figure, since internal errors are usually the result of a software bug, it's not so bad to log errors that only a programmer could love - only programmers need to deal with them, users only need to accurately report them.  (Which reminds me of a war story.)

I have an evolved version of this approach, which I plan to clean up and present in the not-too-distant future.


JAVA

Java adds exception handling, which some would say contradicts my claim that error handling isn't standardized.  However, Java exceptions merely standardize the low-level language tool.  It doesn't really address the approach that a program takes.  When used poorly (e.g. try-catch at every call), it reduces to the unreadability of example 2.  When done well, it approximates the good readability of example 3.  The key is the sacrifice some user-friendliness of error message in favor of keeping the program structure clear.  That sacrifice is worth it since internal errors should rarely happen.


C++

I am a bit embarrassed to admit that I'm not very familiar with C++'s exception handling facility.  I've never worked on a project that used it.  I've heard many programmers state their strongly-held opinion that C++ exceptions are to be avoided at all costs, but I don't have enough personal experience to have my own opinion.  One concrete data point: a computer engineering senior at U of I (a respected institution) tells me that their C++ course does not teach exceptions.

But I figure that even if C++ exceptions are not up to the task, my C method should work fine.  I'm not married to any specific language tool, so long as error handling can be added without obscuring the code.


P.S.

I risk being haunted by the ghost of Dijkstra by using a dreaded goto.  Maybe Knuth will protect me (alternate).  :-)


P.P.S.

I have a slightly different version in my wiki.  I change the goto to abort(), and I also print the human-readable form of errno.  While fine for most Unix tools, those changes will not be appropriate for all kinds of programs, so the form above is more general.

Thursday, June 5, 2014

clang/llvm optimize -O3 understands free()


Wow, I just found an amazing bug, although I'm not "bug" is the right word for it.

I have an object delete function:

void e_del(e_t *e)
{
    if (e == NULL || e->signature != E_SIGNATURE) abort();
    e->signature = 0;
    free(e);
}

I have int signature at the end of the e_t structure, set at object creation to a known value (E_SIGNATURE).  Then, I can check periodically to make sure that a valid object is being used.  This detects a variety of bugs, like writing past the end of a buffer in the e_t structure, or just an invalid pointer.  The setting of signature to zero in e_del() is to detect a double call to e_del() with the same object.

In my unit tests, I coded a double-delete to make sure it worked, but abort didn't get called!  Instead, it double-freed!

Silly me, I must have a stupid error in my logic.  Re-build without optimization to make it easier to set breakpoints and single step.  Works.  Works fine.  Without optimization, my code works.  With optimize level 3, it doesn't.  I generated assembly language for optimize and non-optimize cases, and guess what.  At optimize 3, the zero of signature is gone.  Completely gone.

I had a theory, which I tested by changing free(e) to printf("%p", e) and guess what.  At optimize 3, the zero of signature reappeared.  The code worked as expected.  Change printf back to free, and the zero of signature disappeared again.

Apparently, clang/llvm knows what free() is for.  It knows that only a buggy program would try to access a heap segment after it is freed.  So for any non-buggy (and non-multi-threaded) program, it is safe to throw away the "e->signature = 0" line.   My problem is that I DON'T want to assume a bug-free program.  I want to do a good job of detecting bugs.

How to fix?  A variation on Linux's ACCESS_ONCE macro which I named VOL:

#define VOL(_x, _type) (*(volatile _type *)&(_x))
. . .
VOL(e->signature, int) = 0;
free(e);

This "volatilizes" access to e->signature, forcing the optimizer to leave the operation alone.  (FYI - my reason for making my own macro is that ACCESS_ONCE uses the gcc-specific built-in "typeof()" to remove the need to pass in the data type.  But I didn't want my code to be gcc-specific.)

I've done a lot of multi-threaded programming in my day, most-recently experimenting with lock-free, non-blocking designs, so I've had to worry about the optimizer from time to time.  But this is the first time I've ever had a simple, single-threaded, straight-line code work wrong due to the optimizer.  Think about what this means.  The optimizer knows what free() is for, knows that the freed segment shouldn't be accessed after the free, and makes code generation decisions on that basis.  I wonder what else the optimizer thinks it knows...

UPDATE: in its first posting, mistakenly attributed this issue to gcc.  I forgot that my Mac now uses the clang/llvm compiler.  I tried this with true gcc, and it did NOT omit the clear of signature.

UPDATE2: I changed the assignment of zero to signature to a memset of the whole structure:

void e_del(e_t *e)
{
    if (e == NULL || e->signature != E_SIGNATURE) abort();
    memset(e, 0, sizeof(*e));
    free(e);
}

The compiler did NOT optimize out the memset.  I'm guessing that it thought the memset function might have side effects, which is interesting since memset is a compiler builtin.  In fact, if signature is the only field in e_t, the memset compiles to the exact same instructions as the zero assignment.

UPDATE3: Wow, optimizers are surprising.  I found a case where the memset() IS optimized away.  I tried telling the compiler that e points at volatile memory, and it STILL optimized it away.   I'm back to using VOL() on the individual fields.  Again, this is all clang/llvm on Mac.  I've never seen gcc optimize away code based on knowing what free() does.

Wednesday, April 16, 2014

strtol preferred over atoi

We've all done it: parsed command-line parameters and converting numeric strings to integers using "atoi()".  And (hopefully) we've all felt guilty about it, because "atoi()" sucks.

Let's say I have a program, "blunjo", which takes a "-d" option with numeric debug level (0=no debug, 1=a little debug, 2=a lot of debug).  So I might use it like this:

    blunjo -d 1 input_file1 input_file2 ...

Inside the code I probably call "getopt()" and include the code fragment:

    case 'd':
        debug_opt = atoi(optarg);
        break;

So, what happens if the user forgets exactly how to use it and enters this:

    blunjo -d input_file1 input_file2 ...

In this case, "optarg" points at "input_file1", which "atoi()" happily converts to zero, turns off debug, and silently skips processing "input_file1".  Might be nice if it actually told the user that "input_file1" is an invalid integer and printed the usage string.

Enter "strtol()".  It's a little more complicated to use (also more flexible):

    long int strtol(const char *nptr, char **endptr, int base);

The Linux man page contains two interesting bits:

If endptr is not NULL, strtol() stores the address of the first invalid character in *endptr. If there were no digits at all, strtol() stores the original value of nptr in *endptr (and returns 0). In particular, if *nptr is not '\0' but **endptr is '\0' on return, the entire string is valid.

... the calling  program  should  set  errno to 0 before the call, and then determine if an error occurred by checking whether errno has a non-zero value after the call.

So this is better code:

    case 'd':
        char *p = NULL;  errno = 0;
        debug_opt = strtol(optarg, &p, 10);
        if (errno != 0 || p == optarg || p == NULL || *p != '\0') {
            usage("Invalid numeric value for -d option"); }
Note that you must make sure that "optarg" is non-null before calling strtol.  If you use getopt and specify "d:" then getopt will guarantee a non-null "optarg".  But if you are parsing the command-line string yourself, beware of the user entering "-d" with nothing at all following it - the next "argv[]" will be null.  Also note that the "p==NULL" check is technically not necessary; so long as "optarg" is non-null, "p" will never be left at null.  However, given that I'm not responsible for the code that sets "p", it just seems like good practice to include the sanity check before dereferencing it.

Here's a macro to make it all even easier, handle 0x-prefixed hexidecimal, and even prints a programmer-friendly error specifying the file:line of the call to it:

    #define SAFE_ATOL(a,l) do { \
      char *in_a = a; char *temp = NULL; long result; errno = 0; \
      if (*in_a == '0' && *(in_a+1) == 'x') \
        result = strtol(in_a+2, &temp, 16); \
      else \
        result = strtol(in_a, &temp, 10); \
      if (errno!=0 || temp==in_a || temp==NULL || *temp!='\0') { \
        fprintf(stderr, "%s:%d, Error, invalid numeric value for %s: '%s'\n", \
           __FILE__, __LINE__, #l, in_a); \
        exit(1); \
      } \
      l = result; /* "return" value of macro */ \
    } while (0)
Here's a usage of the macro:

    case 'd':

        SAFE_ATOL(optarg, debug_opt);
Note that on errors, it abruptly exits the program.

Finally, there is also "strtoll()" which returns a long long int, and has the same error-checking.  The functions "strtoul()" and "strtoull()" are similar but for unsigned.

EDIT: I'm pleased to discover that the function "inet_pton()" does a good job of error checking a dotted-decimal IP address.  For example, adding garbage to the end of a valid IP address is flagged as an error.

EDIT2: I've enhanced the above macro in a few ways and put it on my github. See: https://github.com/fordsfords/safe_atoi

Wednesday, February 19, 2014

Type "n" for Safety! or Strncpy() Considered Risky

Everybody knows that "strncpy()" is the "safe" version of "strcpy()", right?  No chance of going past the destination buffer, right?

Um ...

Here's a nice code fragment I found in some old code (names changed to protect the innocent):

    char foo[MAX_LINE_LEN] = "";
    ...
        case 'F':
            strncpy(foo, optarg, sizeof(foo));
            foo_str = foo;
            foo_len = strlen(foo);

The "n" in "strncpy()" makes sure that if the "optarg" is longer than "foo", it only copies "MAX_LINE_LEN" characters.  No buffer overrun.  Nice and safe, right?

But look at what the man page for "strncpy()" says:
Warning:  If there is no null byte among the first n bytes of src, the string placed in dest will not be null-terminated.
If optarg has "MAX_LINE_LEN" or more non-null characters, "foo" ends up without a null termination!  What happens to the "strlen(foo)"?  It runs right off the end and sooner or later will almost certainly result in a seg fault.  (I heard from a co-worker that we had to fix a lot of instances of this same bug in our code base.)

---

You could fix this bug by adding one line below the "strncpy()":
            foo[sizeof(foo)-1] = '\0';
But I don't like this either.  It leaves intact a different objection I have with the original code: it silently truncates something that the user might not want truncated.  Good tools should not silently do anything different than what the user requested.

So, how about this:
        case 'F':
            foo[sizeof(foo)-1] = '\0';
            strncpy(foo, optarg, sizeof(foo));
            if (foo[sizeof(foo)-1] != '\0') {
                fprintf(stderr, "Warning message!");
                foo[sizeof(foo)-1] = '\0';
            }
GAH!  I've seen code like this and I don't like it.  That code is non-obvious.

Here's a tip for writing good code.  How would you verbally describe the basic high-level approach to somebody?  Would you say:
First set a null at the end of the buffer.  We do this because we want to find out if the strncpy function truncates the string.  If so, then that null will be overwritten with the last character copied.  So we after copying, we check that last character; if the null was overwritten, then we warn the user that it was truncated, and then we add the final null (thus truncating it even further).
No, you wouldn't.  You would say something more like:
If the string fits, copy it.  If it doesn't, warn the user and copy in as much as fits, and null-terminate it.
How about this:
        case 'F':
            if (strlen(optarg) < sizeof(foo))
                strcpy(foo, optarg);
            else {
                fprintf(stderr, "Warning message!");
                memcpy(foo, optarg, sizeof(foo)-1);
                foo[sizeof(foo)-1] = '\0';
            }
            ...

I like this better; it flows almost exactly as the description does.

And yes, there is an irrational part of me that wants to use "strncpy()" inside the "if" statement, even though it isn't needed.  You're just "supposed to".  But "strcpy()" IS more efficient, and it is possible to go overboard with the whole "belt and braces" thing.  However, I would not object to somebody using "strncpy()" there.

I guess the moral of the story is: just because the function has an "n" in it doesn't mean that people are going to use it safely.