Tuesday, June 10, 2014

Error handling: the enemy of readability?

I've been a programmer since the mid-1970s, call it 40 years.  During that time I've seen the rise into popularity: structured programming and object-oriented programming.  (More-recently, I've seen increased interest in functional programming, and I've started learning a bit about it myself, but I'm not conversant-enough in it to write intelligently about it.  So I'll omit it from this discussion.)

After all these years, I find it very interesting that I haven't seen a common approach to internal error handling become widely-accepted, the way structure and object-orientation have.

I'm speaking here of internal errors, not user errors.  If the user enters bad input, you owe it to the user to detect the problem and report it in a user-friendly way, allowing the user to correct his mistake.  User error detection and handling code is fully part of the program.

Internal errors are different.  These are found by inserting sanity checks to see if something that "should never happen" might have actually happened.  Very often, failed sanity checks represent a software bug that needs to be tracked down and fixed.


ROBUST CODE vs. READABLE CODE?

Most approaches to internal error handling I've seen are poor.  Maybe it is partly due to a lack of programmer discipline - we all get lazy sometimes and forget to check a return status - but I don't think that is the primary reason.  I believe the main culprit is the need for readable code.

The greatest expense in any significant programming effort is not the original writing of code, it's the subsequent maintenance.  The original programmers eventually leave the project, and less-experienced programmers often take over the less-glamorous job of fixing bugs and adding enhancements.  It is critically important that the code be readable.  External documentation is typically out of date moments after it is written, and even internal documentation (comments) are suspect.  The code itself is the only reality that matters, and the harder it is to understand that code, the harder (and more expensive) it will be to maintain and improve it.

And unfortunately, doing a thorough job of checking and handling internal errors often makes the code harder to read.

C

Which is easier to read and maintain?  This:

example 1:
  foo_t *foo_create()
  {
      foo_t *new_foo = malloc(sizeof(foo_t));

      new_foo->state_filep = fopen("foo.state", "rw");
      new_foo->data_filep = fopen("foo.data", "rw");

      new_foo->reader = reader_create();
      new_foo->writer = writer_create();

      return new_foo;
  }  /* foo_create */

or this:

example 2:
  foo_t *foo_create()
  {
      foo_t *new_foo = malloc(sizeof(foo_t));
      if (new_foo != NULL) {  /* malloc successful? */

          new_foo->state_filep = fopen("foo.state", "rw");
          if (new_foo->state_filep != NULL) {  /* fopen successful? */
              new_foo->data_filep = fopen("foo.data", "rw");
              if (new_foo->data_filep != NULL) {  /* fopen successful? */

                  New_foo->reader = reader_create();
                  if (new_foo->reader != NULL) {  /* reader successful? */
                      new_foo->writer = writer_create();
                      if (new_foo->writer != NULL) {  /* writer successful? */

                          return new_foo;
                      }
                      else {
                          log_error("Could not create writer");
                          reader_delete(new_foo->reader);
                          fclose(new_foo->data_filep);
                          fclose(new_foo->state_filep);
                          free(new_foo);
                          return NULL;
                      }
                  }
                  else {
                      log_error("Could not create reader");
                      fclose(new_foo->data_filep);
                      fclose(new_foo->state_filep);
                      free(new_foo);
                      return NULL;
                  }
              }
              else {
                  log_error("Could not open data file");
                  fclose(new_foo->state_filep);
                  free(new_foo);
                  return NULL;
              }
          }
          else {
              log_error("Could not open state file");
              free(new_foo);
              return NULL;
          }
      }
      else {
          log_error("Could not malloc foo");
          return NULL;
      }
  }  /* foo_create */

Most projects I've worked on use a hybrid of the two approaches, but most lean heavily towards one or the other.  The first is easy to read, but brittle; if anything goes wrong, the problem often doesn't generate any visible symptoms till much later, making it a nightmare to diagnose.  The second detects errors early and does proper reporting and cleanup, but is a nightmare to read and maintain.  (All that duplicated code in the else clauses is sure to get out of sync at some point, resulting in resource leaks or uninitialized accesses, neither of which will be detected unless some error actually happens and the error handling code finally gets exercised.)

Wouldn't it be nice to have both good error detection and readability?

example 3:
  foo_t *foo_create()
  {
      foo_t *new_foo = NULL;
      new_foo = malloc(sizeof(foo_t));  ASSRT(new_foo!=NULL);
      memset(new_foo, 0, sizeof(new_foo));

      new_foo->state_filep = fopen("foo.state", "rw");  ASSRT(new_foo->state_filep!=NULL);
      new_foo->data_filep = fopen("foo.data", "rw");  ASSRT(new_foo->data_filep!=NULL);

      new_foo->reader = reader_create();  ASSRT(new_foo->reader!=NULL);
      new_foo->writer = writer_create();  ASSRT(new_foo->writer!=NULL);

      return new_foo;

  ASSRT_FAIL:
      if (new_foo != NULL) {
          if (new_foo->writer != NULL) reader_delete(new_foo0->writer);
          if (new_foo->reader != NULL) reader_delete(new_foo0->reader);
          if (new_foo->data_filep != NULL) fclose(new_foo0->data_filep);
          if (new_foo->state_filep != NULL) fclose(new_foo0->state_filep);
          memset(new_foo, 0, sizeof(new_foo));  /* poison the freed structure */
          asm volatile ("" ::: "memory");       /* See note [1] */
          free(new_foo);
      }

      return NULL;
  }  /* foo_create */

Note [1]: See http://blog.geeky-boy.com/2014/06/clangllvm-optimize-o3-understands-free.html

My philosophy is to catch errors as early as possible by including sanity checks, but doing it unobtrusively.  Make it easy to ignore if you're looking for the code's basic algorithm, but easy to understand if you are analyzing the error handling.  I think example 3 to be approximately as readable as the first example, and as robust as the second.  It is important to keep the error handling code visually small so that it can be appended to the ends of lines.  This:

      foo->state_filep = fopen("foo.state", "rw");  ASSRT(foo->state_filep!=NULL);

is much cleaner than this:

      foo->state_filep = fopen("foo.state", "rw");
      if (foo->state_filep) {
          blah blah blah
      }

As you've probably guessed, the ASSRT macro does nothing if the expression is true, or does a "goto ASSRT_FAIL;" if the expression is false.  A variation is to use "abort()" instead of goto.  In many ways this is FAR superior for a troubleshooter; a core file can provide a full stack trace and access to program state.  Calling "abort()" also simplifies the code further, removing explicit clean up.  But it also means that the code has given up on any hope of recovery.  Differentiating between an un-recoverable error and a recoverable error is often not possible at the point where the error is detected; it is usually considered good form to report the error and return a bad status.

(FYI - the "memset()" before the "free()" is commented as "poison the freed structure".  The purpose of this is to make sure that any dangling references to the foo object will seg fault the instant they try to de-reference one of the internal pointers.  Again, my philosophy is to find bugs as soon as possible.)


ASSRT()

What is this magical "ASSRT()" macro?

  #define ASSRT(cond_expr) do { \
      if (!(cond_expr)) { \
          char errmsg[256]; \
          snprintf(errmsg, sizeof(errmsg), "%s:%d, failed assert: (%s)", \
              __FILE__, __LINE__, #cond_expr); \
          log_error(errmsg); \
          goto ASSRT_FAIL; \
      }  \
  } while (0)

For those unfamiliar with C macro magic, there are two bits here that might benefit from explanation.  First the "do ... while (0)".  This is explained pretty well here.  It's the standard way to avoid tricky bugs when macros are used as part of if statements.  The second is the "#cond_expr" at the end of the sprintf.  The C preprocessor converts "#macro input parameter" into a C string.

So, the macro expansion of "new_foo = malloc(sizeof(foo_t));  ASSRT(new_foo!=NULL);" is:

      new_foo = malloc(sizeof(foo_t));  do {
          if (!(new_foo!=NULL)) {
              char errmsg[256];
              snprintf(errmsg, sizeof(errmsg), "%s:%d, condition false: '%s'",
                  "foo.c", 50, "new_foo!=NULL");
              log_error(errmsg);
              goto ASSRT_FAIL;
          }
      } while (0);

An important reason for making this a macro instead of a function is so that the __FILE__ and __LINE__ compiler built-ins apply to the source line which uses ASSRT.  I've also sometimes enhanced the ASSRT macro, making it specific to a program or even a function by including prints of interesting state information.  I've also sometimes included errno.  These improvements help the troubleshooter diagnose the internal problem.

One big drawback to this approach?  The error messages logged can only be loved by a programmer. What would an end user do with this?

  Error logger: foo.c:50, failed assert: (new_foo!=NULL)

This is wonderful for a programmer; I don't know how many times I've seen an error log containing, "could not open file", only to find that the string appears many times in the source code.  This pin-points the source file and line.  However, this message is virtually useless to an end user.

On the other hand, is it really any worse than this?

  Error logger: could not malloc foo

This looks less like geek-speak, but doesn't actually help an end user any more.  In both cases, the user has little choice but to complain to the developer.

My point is that including an error-specific, highly-descriptive message makes the error handling code longer and more-intrusive.  One goal of mine is to keep the ASSRT on the same line as the line it's checking; hard to do if including a descriptive message.  But as soon as it overflows onto subsequent lines, it interferes with the logical flow of the code, making it harder to read.  You figure, since internal errors are usually the result of a software bug, it's not so bad to log errors that only a programmer could love - only programmers need to deal with them, users only need to accurately report them.  (Which reminds me of a war story.)

I have an evolved version of this approach, which I plan to clean up and present in the not-too-distant future.


JAVA

Java adds exception handling, which some would say contradicts my claim that error handling isn't standardized.  However, Java exceptions merely standardize the low-level language tool.  It doesn't really address the approach that a program takes.  When used poorly (e.g. try-catch at every call), it reduces to the unreadability of example 2.  When done well, it approximates the good readability of example 3.  The key is the sacrifice some user-friendliness of error message in favor of keeping the program structure clear.  That sacrifice is worth it since internal errors should rarely happen.


C++

I am a bit embarrassed to admit that I'm not very familiar with C++'s exception handling facility.  I've never worked on a project that used it.  I've heard many programmers state their strongly-held opinion that C++ exceptions are to be avoided at all costs, but I don't have enough personal experience to have my own opinion.  One concrete data point: a computer engineering senior at U of I (a respected institution) tells me that their C++ course does not teach exceptions.

But I figure that even if C++ exceptions are not up to the task, my C method should work fine.  I'm not married to any specific language tool, so long as error handling can be added without obscuring the code.


P.S.

I risk being haunted by the ghost of Dijkstra by using a dreaded goto.  Maybe Knuth will protect me (alternate).  :-)


P.P.S.

I have a slightly different version in my wiki.  I change the goto to abort(), and I also print the human-readable form of errno.  While fine for most Unix tools, those changes will not be appropriate for all kinds of programs, so the form above is more general.

No comments: