Friday, January 25, 2013

Debugging Tips

A young pup of an engineer recently wrote to me:
    I was wondering (since I'm taking a coding class and I feel that one
    of my weakest points is debugging), do you have any wisdom that you
    could pass along? I would appreciate it immensely since it will most
    likely save me a considerable amount of time on the MP's we have to
    do. If my first coding class taught me anything, the longest part
    about those assignments is the debugging process (and I was awful at
    debugging my own code).
Ah, debugging - my favorite part of software development.  Hey, why are you laughing?  I'm serious!  I honestly do like debugging best.  Here are a few tips that you might find useful.  Being primarily a C programmer, my examples are in C.  But they have analogs in any language.

Bottoms Up!

Maybe I like debgging for the same reasion I like detective shows.  It took me a long time to get tired of CSI, and I still like to watch the odd episode.  I approach debugging like the CSI guys - I follow the evidence via a technique I call bottom-up debugging.

There are other approaches.  Some people take a top-down approach to debugging.  Look at the code and what it is supposed to do, and try to figure out what is broken.  It's all about, "Hmm ... that looks right ... so does that ... that should work ... gee, the code looks right to me.  Guess I need to look harder."  Yes, I'm making a bit of fun of it, but I have seen it successfully used.  I know a guy who can often just look at a screen of code for a minute and point to the bug.

Top-down debugging almost never works for me.  Bottom-up debugging asks the question: what is the code *actually* doing?  It's all about about collecting evidence and saying, "Whoa, I wonder why it did THAT!"  Followed by collecting more evidence.  Unfortunately, most of that evidence is hidden deep inside the chips.  So the job of the debugging engineer is to expose the evidence of what is going on.

Most people know how to use a source-level debugger to single-step code and examine variables, and this is fine for some bugs.  At every step of the way, you can see exactly why it is doing what it is doing.  But single-stepping runs out of steam in a hurry, especially when you need to see what's happening inside nested loops that might have thousands of iterations, or when the program is part of a system of multiple cooperating processes - you often can't use break points and single-stepping without breaking the larger system.

Kick and Scream

So what do you do if it's not feasible to single-step?  Next in most people's bag of tricks is print statements.  Sprinkle liberally throughout your program, printing the values of interesting variables and structures.  The hope is that you'll catch the program as it starts to misbehave.  But especially with looping programs where it can take a long time for the bug to manifest, the amount of output can be overwhelming.  Plus, with cooperating processes, the slowdown introduced by prints can itself disrupt the behavior of the program, long before your elusive bug is caught.  Or worse yet, the prints might change the timing of the program enough for the bug to disappear entirely.  "Works just fine when I turn on debug output, maybe we should just ship it that way."  :-)  This is known as the observer effect where the act of measuring something changes the thing under measurement.  (It is sometimes called the uncertainty principle, but this is not quite right.  The uncertainty principle leads to the observer effect, but it is not the same thing.)

Instead of printing huge volumes of debug output, I like to define a function named kick_and_scream.  The idea is that you add debug code to your program to test for unexpected conditions, and call kick_and_scream if something is wrong.  You might pass in some interesting variables and structures which kick_and_scream will dutifully print the contents of, giving you some insight into the program's state.  What it does next depends.  Here are some things I've had kick_and_scream do after printing state variables:
  • Prompt and read a line from the user.  This has the effect of essentially halting execution of the program, giving you a chance to fire up the debugger.
  • Dump core - for Unix, you should be able to call "abort()".  I assume Windows has something similar (Dr. Watson?).  The idea is to get a memory dump of the program which can be examined with a debugger.  In C, you can always force an illegal memory access:
        *(char *0) = 0;
    that often does the trick.
  • Infinite loop - I've done this for embedded systems whose operating systems are very basic and don't have any kind of core dumping capability.  Maybe you have an in-circuit emulator or an external debugger that you can use to interrupt the program and examine state.  Having the program stuck in an infinite loop will prevent it from destroying the evidence.
For example, let's say that your program maintains a linked list of nodes, and there's a global counter that tells how many nodes there are in the list.  Let's say that you have a search function which is not finding a node that you *know* is in there:
    cur_node = head_node;
    for (i = 0; i < num_nodes; i++) {
        if ( == query_id) return cur_node;
        cur_node = cur_node.next_node;
    return NULL; /* not found */
The problem is that it returns NULL when it shouldn't.  So, why is the function not finding the node?  Maybe the linked list is corrupted.  Maybe num_nodes is wrong.   Maybe a hundred other things.  Top-down debugging would consist of staring at the code to see if there's a mistake which would cause any of those.  Bottom-up consists of figuring out exactly which problem it really is.  I would add a call to kick_and_scream right before the return NULL; and pass it head_node, cur_node, i, num_nodes, and query_node.  Inside kick_and_scream I would print all those, including all the fields of head_node and cur_node (assuming the pointers are non-null).

Working Back

So, let's say you do the above and you determine that when it loops num_nodes times, the last node it checks has a non-null next_node field.  This shouldn't happen - the last node should have a null next_node.  You've made some excellent progress - now you know the probable proximate cause of not finding the node - num_nodes is out of sync with the list itself.  But the bug isn't actually here.  You want to find out where the list got out of sync with num_nodes in the first place.

I would write a new function: list_check.  Pass it num_nodes and head_node and it does this:
    cur_node = head_node;
    for (i = 0; i < num_nodes; i++) {
        if (cur_node.next_node == NULL) kick_and_scream(...);
        cur_node = cur_node.next_node;
    if (cur_node.next_node != NULL) kick_and_scream(...);
The idea is that you can now sprinkle calls to check_list all over your program and basically "binary search" your way to discovering where things get out of sync.

Event Logger

The philosophy of kick_and_scream is to not print out reams of debug output, but instead to only print interesting information once an obvious problem is discovered.  Sometimes, however, you really do need to see sequences.  You might have some kind of state machine which processes input events, and you suspect that the events are arriving in an invalid order.  (Ever try to close a file before you open it?) There may not be enough information available just printing current state - you need to see that state evolve as events are being processes.

So back to adding normal print statements and crawling through thousands of lines of output.

But what if, as I hinted above, the print statements are too disruptive?  Like if they change the timing of the system too much?

I've had good luck with an event logger.  Call it a poor-man's ultra-low-impact print statement.  This is basically a global array treated like a circular queue.  Instead of printing a like of debug output, you add debug values to the array.  If the array fills, it cycles back to the beginning and overwrites the earliest values.  So low impact that it won't have any effect on timing.

The use of the event logger still depends on detecting the problem inside the code and calling kick_and_scream.  The kick_and_scream function would then print the contents of the array.  It basically shows you the last N events which led up to the failure (where N is the size of the global array).

The disadvantage of this approach is that the output is MUCH harder to interpret than nice descriptive print statements.  So I only use this when print statements disrupt the program too much to be useful.  Which happens to be fairly frequently in the kinds of software I write.

I've implemented this several times, and I want to create a downloadable package for it.  I'll announce it when it's ready.


No comments: