Friday, October 24, 2014

Event logger

A long time ago, I mentioned an event logger that I promised I would create a downloadable package for.  I obviously never did.

I've had a couple of opportunities to recommend it recently, so I decided to accept the fact I may never get it a nice neat package, and I should just describe it.

The idea is that in multi-threading and/or distributed systems, it is often not practical to debug problems by adding debug logging.  Sometimes logging changes the timing of the code too much (either masking the problem you're tracking, or even causing new problems due to slowness).  Another problem with logging debug output is that often it takes millions of loops before a problem reproduces, and nobody wants to crawl through gigabytes of log files.  (I can't tell you how many times we've filled disks up with debug logs trying to catch hard-to-reproduce problems.)

So my event logger is intended to be a VERY low-impact debugging aid.  It is a bit of a pain to use, so it is often more of a last resort, but sometimes it's the only way to track a tricky race condition.

Here's a minimal version of it:

/* globals */
#define EVLOG_SIZE 65536                      /* power of 2 */
volatile unsigned long evlog[EVLOG_SIZE];
volatile unsigned long evlog_num_logs = 0;

/* The ORing constant 0x80000000 is there to give a hint that the entry contains a line number. */
#define EVLOG0 do { evlog[ (evlog_num_logs++) & (EVLOG_SIZE-1) ] = 0x80000000 | __LINE__; } while (0)
/* EVLOG1 should only be used immediately after EVLOG0 (like on the same line) */
#define EVLOG1(val) do { evlog[ (evlog_num_logs++) & (EVLOG_SIZE-1) ] = val; } while (0)

Now inside your code, you can sprinkle calls to EVLOG0 and EVLOG1:

    EVLOG0;
    s = socket(...);
    EVLOG0;  EVLOG1(s);
...

    EVLOG0;
    close(s);
    EVLOG0;  EVLOG1(s);

So, the above will create a nice log of events in the global evlog array.  What do you do with it?  One thing is that by making it evlog and evlog_num_logs globals, you can find them with gdb if your program crashes and dumps core.  Also, if your code contains sanity tests and calls some kind of fatal error function, that function can write the contents of those globals to a debug dump file.  Finally, you can even add a user interface to trigger dumping the globals to a file.

Note that although I use this code for multithreaded debugging all the time, it is technically not thread-safe.  The evlog_num_logs++ is not atomic, for one thing, which means that if two CPUs execute it simultaneously, the variable will only be incremented once.  However, given that the primary goal of this event logger is to minimize impact, it is usually better to live with the risk of a mildly-corrupted event log.

I've used this technique in the past, and there is lots of room for improvement.  For example, expand what you save with each event: thread ID, high-resolution time stamp, source file name (to go with line number), etc.  But be careful - add too much and you deviate from the "minimal impact" goal.

One final thought - many products have a "debug mode" in which lots of debug output is spewed to a debug file.  I've already mentioned some problems with this -- changing timing can mask or introduce problems and too much output -- but there's a much bigger problem: debug mode is never there when you need it.  Customer calls you up and says, "a bad thing happened."  You say, "turn on debug mode."  Customer does, and is not able to reproduce the problem.  With this kind of low-impact event logging, you can leave it on all the time.  Have it auto-dump on crashes, and have a customer-triggerable dump also.  You may not be logging the right events, but at least you'll have *some* evidence to work with ... better than "a bad thing happened."

No comments: