Thursday, December 31, 2015

C.H.I.P. - a small, cheap computer

UPDATE: C.H.I.P. is dead

Or rather, the company is dead.  I will keep my CHIP content laying around, but I won't be doing much with it.  Eventually, I'll probably move to Raspberry Pi.


I just received the C.H.I.P. single-board computer whose kickstart I supported.

I suspect that anybody reading this blog knows what a Raspberry Pi is.  CHIP is like that, with its own spin which I found appealing.  It's goals:
  • Low cost.  The base price is currently $9 (although see the $5 Pi-zero).
  • Ease and speed of getting started.  It's literally ready to surf the net right out of the box.
  • Small size.  It's smaller around than a business card (although obviously it is thicker).
Accessories most people will use to try it out:
  • Power supply.  It has a micro USB connector, so you might already have a charger that will work.  But be careful: even though the doc says it draws 300 ma peak, get one that can handle over an amp.  My 700 ma supply craps out.  My 1.5 amp does fine.
  • Audio/video cable.  If you have a video camera, you might already have the right one (although I've heard some are wired differently).  If not, you can get one from the CHIP store.
  • TV with 3 "RCA" composite audio/video jacks.  Most TVs have them.
  • USB keyboard.
  • USB mouse.
  • Powered USB hub to power the keyboard and mouse.  Maybe you already have one.  But note that this hub will *not* power the CHIP.  You'll still need the micro USB supply.
  • Case.  Completely optional, but at $2, I suggest getting it.
I did go that route for my first bootup.  Being packrats, my wife and I had all the above items, so we were able to get the GUI operational in about 15 minutes.  Not being very familiar with Linux GUIs, it took another 15 minutes to be surfing the web.  Yep, it really is a general-purpose computer.  But being composite video, it has very low resolution.  And drawing a couple of watts, it isn't very fast (compared to modern laptops).  And being Linux, it doesn't have much PC-like software out of the box.  But really, who would want to use one of these as a replacement for your laptop?  ... Oh.  Um.  Never mind.  :-)


If you aren't interested in the GUI, you don't need any of the above accessories.  All you need is a laptop and a micro USB cable (I had 3 laying around).  Again, be careful: we had a couple cables that were wired for power only and didn't pass signals.

Here's a minimal "getting started" procedure:
    http://www.geeky-boy.com/w/CHIP_startup.html

Thursday, October 15, 2015

Windows corrupting UDP datagrams


We just discovered that under a somewhat unlikely set of circumstances, Microsoft's Windows 7 (SP 1) will corrupt outgoing UDP datagrams.  I have a simple demonstration program ("rsend") which reliably reproduces the bug.  (I'll be pointing my Microsoft contact at this blog post.)

This bug was discovered by a customer, and we were able to reproduce it locally.  I wish I could take the credit, but my friend and colleague, Dave Zabel, did most of the detective work.  And amazing detective work it was!  But I'll leave that description for another day.  Let's concentrate on the bug.


CIRCUMSTANCES FOR THE BUG

1. UDP protocol.  (Duh!)
2. Multicast sends.  (Does not happen with unicast UDP.)
3. A process on the same machine must be joined to the same multicast group as being sent.
4. Window's IP MTU size set to smaller than the default 1500 (I tested 1300).
5. Sending datagrams large enough to require fragmentation by the reduced MTU, but still small enough *not* to require fragmentation with a 1500-byte MTU.

With that mix, you stand a good chance of the outgoing data having two bytes changed.  It seems to be somewhat dependent on the content of the datagram.  For example, a datagram consisting mostly of zeros doesn't seem to get corrupted.  But it's not that hard to find datagram content that *is* consistently corrupted, so my "rsend" demonstration program has one such datagram hard-coded.

Regarding #3, for convenience the rsend program contains code to join the multicast group, but I've also reproduced it without rsend joining, and instead running "mdump" in a different window.

Finally, be aware that I have not done a bunch of sensitivity testing.  I.e. I haven't tried different datagram sizes, different multicast groups, different MTU settings, jumbo frames, etc.  Nor did I try different versions of Windows (only 7), different NICs, etc.  Sorry, I don't have time to experiment.


BUG DEMONSTRATION

This procedure assumes that you have a Windows machine with its MTU at its default of 1500.  (You change it below.)

1. Build "rsend.c" on Windows with VS 2005.  Here's how I build it (from a Visual Studio command prompt):

  cl -D_MT -MD -DWIN32_LEAN_AND_MEAN -I. /Oi -Forsend.obj -c rsend.c

  link /OUT:rsend.exe ws2_32.lib mswsock.lib /MACHINE:I386 /SUBSYSTEM:console /NODEFAULTLIB:LIBCMT rsend.obj

  mt -manifest rsend.exe.manifest -outputresource:rsend.exe;1

2. Run the command, giving it the ip address of the windows machine's interface that you want the multicast to go out of.  For single-homed hosts, just give it the IP address of the machine.  For example:
    rsend 10.1.2.3
To make the tool easy to use, it hard codes the multicast group 239.196.2.128 and destination port 12000.

3. On a separate machine, do a packet capture for that multicast group.  Note that the packet capture utilities I know of (wireshark, tcpdump) do *not* tell the kernel to actually join the multicast group.  I generally deal with this using the "mdump" tool.  Run it in a separate window.  For example:
    mdump 239.196.2.128 12000

In the packet capture, look at the 1278th and 1279th bytes of the UDP datagram data: they should both be zero.  Here they are, with a few bytes preceding them:
    0x75,0x34,0x34,0xa4,0xc5,0xb4,0x00,0x00
NOTE: at this point, the datagram will fit in a single ethernet frame, so no IP fragmentation happens.

4.  While rsend is running, open a command prompt with administrator privilege (right-click on "command prompt" icon and select "run as administrator") and enter:
    netsh interface ipv4 set subinterface "Local Area Connection" mtu=1300 store=persistent

Like magic, bytes 1278 and 1279 of the outgoing UDP datagrams change their values!  Note that with an MTU of 1300, this UDP datagram now needs to be fragmented.  If using wireshark, you'll need to examine the *second* packet to see the entire UDP datagram and get to byte 1278.  I consistently see 0x62,0x27, but that seems to be dependent on datagram content as well.

5. Undo the MTU change:
    netsh interface ipv4 set subinterface "Local Area Connection" mtu=1500 store=persistent

Magically, the bytes go back to their correct values of 0x00,0x00.

Note: if you comment out the setsockopt of IP_ADD_MEMBERSHIP, the corruption will not happen.  The multicast datagrams will still go out, but they will be undamaged when the MTU is reduced.  The obvious suspect is the internal loopback.


SOLUTION

The only solution I know of is to leave the Windows IP MTU at its default of 1500.


WHY SET MTU 1300???

I don't know why our customer set it on one of his systems.  But he said that he would just set it back to 1500, so it must not have been importat.

If you google "windows set mtu size" you'll find people asking about it.  In many cases, the user is trying to reach a web site which is across a VPN or some other private WAN link which does not have an MTU of 1500.  The way TCP works is that it tries to send segments (a TCP segment is basically an IP datagram) as large as possible while avoiding IP fragmentation.  So a TCP instance sending data might start with a 1500-byte segment size.  If a network hop in-transit cannot handle a segment that large, it has a choice: either fragment it or reject it.  TCP explicitly sets an option to say, "do not fragment," so the network hop drops the segment.  It is supposed to return an ICMP error, which the sender's TCP instance will use to reduce its segment size.  This algorithm is known as TCP's "path MTU discovery".

But many network components either do not generate ICMP errors, or do not forward them.  This is supposedly done in the name of "security" (don't get me started).  This breaks path MTU discovery.  But the segments are still being dropped, so eventually the TCP sender times out and the web site doesn't work.  Apparently this is fairly rare, but it does happen.  Hence the  "set MTU" questions.  If the user reduces IP's MTU setting, it artificially reduces the maximum segment size used by TCP.  Do a bit of experimenting to find the right value, et Voila!  (French for "finally, I can download porn!")

So, how could Microsoft possibly not find this during their extensive testing?  Well first of all, UDP use is rare compared to TCP.  Multicast UDP is even more rare.  Sending UDP multicast datagrams larger than MTU is getting close to unicorn rare.  And doing all that with the IP MTU set to a non-standard value?  Heck, I consider myself to be a pretty rigorous tester, and I would never have tried that.


UDPATE:

Thanks to Mr. Anonymous for asking the question about NIC offloading.  We had considered the question previously (see my response in the comments), but in composing my response, I got to thinking about the offset of the corruption.

It's always in the second packet of the fragmented datagram, and always at 1278.  But that offset is with respect to the start of the UDP payload.  What is the offset with respect to the start of the second packet?  I didn't look at this before since Wireshark's ability to reassemble fragmented datagrams is so handy.  But I went ahead and clicked the "Frame" tab and saw that the corruption happens at offset 40 from the start of the packet.

Guess where the UDP checksum belongs in an UNfragmented datagram!  Yep, offset 40.  Something decided to take the second packet of the fragmented datagram and insert a separate UDP checksum where it *should* go if it were not a fragment.

This still seems like a software bug in Windows.  Sure, maybe the NIC is doing the actual calculation.  Maybe it's not.  But it only happens when IP is configured for a non-standard MTU.  If I have MTU=1500 and I send a fragmented datagram, there is no corruption.


UPDATE 2:

I did some experimenting with datagram size and verified something that I suspected.  When the MTU is set to 1300, the corruption only happens when the datagram size is such that a 1500-byte MTU would *not* fragment but a 1300-byte MTU does.  I.e. there is a size range of 200 bytes (the difference between 1300 and 1500).   This is another reason Microsoft's testers apparently didn't discover this.  Even if they tested fragmentation with non-standard MTUs, would they think to test a size in that specific range?  With the benefit of hindsight, sure, it's "obvious".  But if you're just testing combinations of configurations, you would just pick the "send fragments" combination, which is probably chosen to fragment with MTU 1500.  (FYI: I've updated the original post to refine the conditions of the bug.)

I'm normally not a Microsoft cheerleader, so it feels weird to be defending them on this bug.  :-)


UPDATE 3:

Since we noticed that the corruption always happens at offset 40 in the second packet, I decreased the size of the datagram to only include half of the corrupted pair.  Sure enough, the last byte of the datagram got corrupted.  An the second corrupted byte?  Who knows.  I kind of hoped it would corrupt something in Windows and maybe blue-screen it, but no such luck.  I didn't "see" any misbehavior.

Does that mean there *was* no misbehavior?  NO!  The outgoing datagrams suddenly had bad checksums!  Meaning that the mdump tool stopped receiving them since Linux discards datagrams with bad checksums.  But tcpdump captures the packets *before* UDP discards them, so you can see the bad checksums.

I kept decreasing the size of the datagram till it was 1273 bytes.  That still triggers fragmentation when MTU=1300.  The outgoing datagrams had no visible corruption but had bad checksums.  Reduce one more byte, and the datagram fits in one packet.  Suddenly the checksums are OK.

I tried a few things, like sending packets hard, and varying their sizes, but other than the bad checksums I could not see any obvious Windows misbehavior.

I guess my days as a white-hat hacker are over before they started.  (Did I get the tenses right on that sentence?)

Well, I think I'm done experimenting.  If anybody else reproduces it, please let me know your Windows version.


UPDATE 4:

I heard back from my contact at MS.  He said:

We've looked into this, and see what is happening.  If the customer needs to pursue this rather than using a work around (e.g. not setting the MTU size on the loopback path to a different size than the non-loopback interface, etc.) they will need to open a support ticket.  Thank you for letting me know about this."
Which I suspect translates to, "We'll fix it in a future version, but not urgently.  If you need urgency, pay for support."  :-)


REDDIT:

Finally, Hi Reddit users!  Thanks for pushing the hits on this post to many times the total hit count for the whole rest of the blog.  :-)  I read the comments and saw that my first update had already been noticed by somebody else.

Also, something a lot of Reddit comments have fixated on is my claim that UDP multicast is rare.  I meant that the number of programs (and programmers) that use it is very small compared to all software, not that multicast is hardly ever used.  As pointed out, there are several areas of network infrastructure which are multicast-based, so it gets used all the time.  My point is that the number of programmer-hours spent *writing and testing* multicast-based software is very small compared to the overall networking software field.  And as such, it tends not to be as burned-in as, say, TCP.

Also, in most multicast software that I have learned the guts of, the programmer makes sure that datagram sizes are kept small so as to avoid fragmentation.  This seems to be due to the commonly-held idea that you should *never* let IP fragment, which I think comes from the fact that, at least historically, router performance is hurt if it has to perform fragmentation while a datagram is in transit.  I'm not sure if this is still true for modern routers, but historically fragmentation needed to be handled by the supervisory processor.  For the odd packet every now and then, no problem.  For high-rate data flows, it can kill a router.

That seems to be the basis on which a lot of multicast software avoids fragmentation, preferring instead to split large messages into multiple datagrams.  But this reasoning is often not applicable.  Our software intended primarily to be used within a single data center.  When we send a 2K datagram, no router needs to worry about fragmenting it; the sending host's IP stack splits the datagram into packets before they hit the wire.  The intermediate switches and routers all have 1500 MTU, allowing the packets to traverse unmolested.  The final receiving host(s) reassemble and pass the datagram to user space.  This has a noticeable advantage for high-performance applications since the same amount of user data is passed with fewer system calls (the overhead of switching between user and kernel space is significant).

So while I'm sure our software is not alone in sending fragmented multicast datagrams, I stand behind my claim that sending fragmented multicast is relatively rare.

Wednesday, September 30, 2015

Coding Pet Peeves

Ok, these things are not the end of the world.  During a code inspection, I might not even bring them up.  Or maybe I would - it would depend on my mood.


INT OR BOOLEAN

    if (use_tls != 0) {
        /* Code having to do with TLS */
    }

So, what is "use_tls"?  It's obviously somewhat of a flag - if it's non-zero then we need to do TLS stuff - but is it a count?  If it's a count, then it should have been named "use_tls_cnt", or maybe just "tls_cnt".  If it's being used as a boolean, then it should have tested as a boolean:

    if (use_tls) {
        /* Code having to do with TLS */
    }

And yes, this is actual code that I've worked on.  It was being used as a boolean, and should have been tested as one.  Is this a big deal?  No, like I said, I might not have brought it up in a review.  But I am a believer that the code should communicate information to the reader.  The boolean usage tells the reader more than the numeric version (and is easier on the eyes as well).


MALLOC AND FREE

    state_info = OUR_MALLOC_1(sizeof state_info_t);
    /* Code using state_info */
    free(state_info);

Again, actual code I've worked on.  We have a code macro for malloc which does some useful things.  But the author couldn't think of anything useful to do with free so he didn't make a free macro.  He should have.  There have been at least two times (that I know of) when it would have been tremendously useful.  One time we just wanted to log a message for each malloc and free because we suspected we were double-freeing.  Another time we thought that a freed structure was being accessed, and we wanted free to write a pattern on the memory before freeing.

"No problem," you say, "just create that macro and name it free."  Nope.  There are other mallocs: OUR_MALLOC_2, OUR_MALLOC_3, etc (no they aren't actually named that).  And some of the code doesn't use any of the macros, it just calls malloc directly!  For that second feature (writing a pattern before freeing), you need special code in the malloc part as well as the free part.  These all should have been done consistently so that OUR_MALLOC_1 only works with OUR_FREE_1, etc.  That would have allowed us to do powerful things, but as it is, we couldn't.


UNNATURAL CONDITIONAL ORDER

I bet there is an official name for this:

    if ('q' == cmd) break;

I've always felt that code should be written the way you would explain it to somebody.  Would you say, "If q is the command, then exit the loop."  No.  You would say, "If the command is q, then exit the loop."  I've seen this construct a lot where they put the constant in front of the comparison operator, and it is almost always awkward.

Oh, I know there's a good reason for it: it's a form of defensive programming.  Consider:

    if (cmd = 'q') break;

Oops, forgot an equals.  But the compiler will happily assign 'q' to cmd, test it against zero, and always do the break.  Swap the variable and the constant and you get a compile error.

But come on.  Gcc supports the "-parentheses" option which warns the above.  In fact, gcc supports a *lot* of nice optional warnings, which people should be using all the time.  "-Wall" is your friend!


DECLARE CLOSE TO FIRST USE

Ok, I'm more of a recent convert here.  I've written hundreds of functions like this:

    void my_funct(blah blah blah)
    {
        int something = 0;
        int something_else = 1;
        .... and another 10 or 15 variables

In the REALLY old days, you *had* to declare all local variables immediately after the function's open brace.  But modern C compilers let you declare variables anywhere you want.  Variables should be declared and initialized close to their first use.  This has two advantages: as the reader is reading the code and he encounters a variable, he can see right away what its type and initial value is without having to scroll up and back down again.  But more importantly, when the user sees the variable declared, he knows that the variable is not used anywhere above that declaration.  Whereas if it is declared at the top of the function, he has to examine all the function code above the first use to make sure that it isn't secretly being changed from its initial value.


POINTLESS COMMENTS

    i++;  /* increment i */

Oh, thank goodness somebody taught me what the "++" operator does!  Note to programmer: you don't need to teach me the language.

    fclose(file[i]);  /* close the ith file */

Are you sure?  I thought the "fclose" function *opened* a file.  Note to programmer: you don't have to teach me the standard library functions.  They all have man pages.

    session_cleanup(client[i]);  /* clean up the client session */

OK, session_cleanup() probably doesn't have a man page.  But the comment doesn't contain any more information than the function and variable names!

One might add comments to say what the variable "i" is for and what the files in the "filep[]" array are for.  That at least gives information that a reader might not know.  But even those should not be comments, they should be descriptive variable names.

Most comments I've encountered should simply be removed as unnecessary clutter that makes it hard to read the code.  The best comments don't try to explain *what* the code is doing, they explain *why* it is being done.  Usually that is pretty obvious, so only give the "why" explanations when the answer may not be obvious.

    client_num++;

    /* Usually we would close the log file *after* we clean up the
     * session, in case the cleanup code wants to log an error.  But
     * the session_cleanup() function is legacy and is written to
     * open and close the log file itself on error.  */
    fclose(log_file[client_num]);
    session_cleanup(client[client_num]);

That's better.


PETTY PEEVE

"Really Steve?  Brace indention?  What are you, 12?"

(grumble)  I know, I should have grown past this way back when I realized that emacs v.s. vi just doesn't matter.  And really, I don't much care between the various popular styles.  If writing from scratch, I put the open brace at the end of the "if", but I'm also happy to conform to almost any style.

But I just recently had the pleasure of peeking inside the OpenSSL source code and I saw something I hadn't seen for many years:

    if (blah)
        {
        code_to_handle_blah;
        }

It's called "Whitesmith" and supposedly has advantages.  But those advantages seem more philosophical than practical ("the alignment of the braces with the block that emphasizes the fact that the entire block is conceptually (as well as programmatically) a single compound statement").  Um ... sorry, I prefer actual utility over making a subtle point.  Having the close unindented makes it easier to find the end of the block.   The jargon file claims that, "Surveys have shown the Allman and Whitesmiths styles to be the most common, with about equal mind shares."  Really?  I've been a C programmer for over 20 years with 7 different companies, and none of those places used Whitesmith.  This is only the second time I've even seen it.  Equal mind share?   Here's a survey which has Allman 17 times as popular as Whitesmith.  I think jargon file got it wrong.

And yeah, I know that if I were immersed in it for a few months, I wouldn't even notice it any more.  Just like emacs v.s. vi, it doesn't really matter, so long as a project is internally consistent.  But hey, this is *my* rant, and I don't like it!  So there.

Thursday, September 10, 2015

RS-232 serial ports

Remember RS-232?  It is rare now-a-days to find a need for RS-232, but it can come up.  In my case, a couple of machines in our lab have their system consoles tied to a dedicated serial lines which expect a "dumb-ish" terminal (I actually owned a real-life VT-100 for a while).  Again, it is rare that one needs the system console, but especially when you have hardware issues, sometimes it is the only way to get in.

RS-232 is an interface specification which was basically invented to do one thing: connect a MODEM to either a computer or a terminal.  If you look at the RS-232 spec, you will see very MODEM-ish signals, like carrier detect, ring indicator, etc.

The RS-232 interface has two different sides, one named "DTE" and the other named "DCE".  The DTE side is for a terminal or computer.  The DCE side is for a MODEM.  RS-232 is a standard which, among other things, defines what pins on a connector are for which purpose.  If you look at an RS-232 pinout, you will see names on the pins.  For example, on a DB-25 or a DB-9 connector, pin 2 is labeled "Tx".  It is named from the point of view of the DTE, which is to say that the DTE side of the interface is supposed to output to pin 2 serial data it is transmitting to the DCE.  The DCE side is supposed to input from pin 2 serial data it is receiving from the DTE.  So the names make sense for the DTE (it transmits on Tx and I receive on Rx) and the names are backwards for the DCE (it receives on Tx and transmits on Rx).  DTE devices normally have male connectors (with pins) and DCE devices normally have female connectors, although back in the old days, I did find the odd device which violated this standard.

These days, we rarely use MODEMs.  Most of the time we deal with RS-232, we are trying to connect two DTE devices together, like connecting a terminal to a computer.  You can't just use a normal cable for that since both the terminal and the computer are DTEs and will try to transmit on Tx (pin 2).  So to get two DTEs to talk, you need a "null modem".  This is usually a specially-wired cable with a female connector on both ends.  For example, one side's Tx pin is wired to the other side's Rx pin, and vice versa.  See https://en.wikipedia.org/wiki/Null_modem for details.

Modern laptops often don't have serial lines any more, so you'll probably need a USB serial adapter.  I have a Belkin which connects to a Windows PC and presents as COM3 (required driver).  I use putty, which lets you select "serial" and specify which COM port.  I also have a TRENDnet which connects to my Mac and presents as /dev/tty.usbserial (also required driver).  I use the normal Mac "terminal" application and enter:

    screen /dev/tty.usbserial 9600

Both of my USB devices are male DTE.  To use either one as a serial console, I need to use a female-to-female null modem.

Saturday, July 18, 2015

Rsync, Ssh as Root, and Solaris

In my distant past, I dabbled in a bit of Unix system administration.  (How distantly in the past?  SunOS 4.2.  Yeah, I'm old.)

This year, my department lost its system administrator and the decision was made not to replace him. So I and a co-worker (who has also dabbled) are now responsible for administering a lab with 50+ machines, most of them Linux, Windows, and Solaris, but with a scattering of other Unixes like AIX, HP-UX, and FreeBSD.

To make our lives easier, I took the probably inadvisable step of using SSH shared keys to allow our primary file server to log into all other Unix machines as root without password.  (Yes, I know - this allows an intruder who gains root access to our main file server to have root access to all other Unix systems in the lab.  However, understand that the main file server is the only system that particularly *needs* to be secure; the other systems are primarily for testing.  And I don't allow root access in the opposite direction, from test system to file server.)

Anyway, one thing I use this for is to back up selected system areas, like /etc, using rsync.  I was successfully using rsync to back up /etc on most of our test systems, but it wasn't working on Solaris.  A bit of detective work revealed the following (run from our main file server):

    # ssh SolarisBox let | grep PATH
    PATH=/usr/sbin:/usr/bin

This, in spite of the fact that root's .profile set PATH to include /usr/local/bin, which is where rsync lives on our Solaris boxes.  Turns out that on Solaris, .profile is only executed for interactive shells.  I needed to edit /etc/default/login and add:

    SUPATH=desired path

Et voilà!  (Like my Unicode there?)  Backups achieved.

Don't worry, I don't imagine that I'm any kind of expert at system administration.  I won't be posting much on the subject.  :-)

Wednesday, July 8, 2015

UTF-8, Unicode, and International Character Sets (Oh My!)

I'm a little embarrassed to admit that in my entire 35-year career, I've never really learned anything about international character sets (unicode, etc).  And I still know very little.  But recently I had to learn enough to at least be able to talk somewhat intelligently, so I thought I would speed up somebody else's learnings a bit.

A more in-depth page that I like is: http://www.cprogramming.com/tutorial/unicode.html

Unicode - a standard which, among other things, specifies a unique code point (numeric value) to correspond with a printed glyph (character form).  The standard defines a code point space from 1 to 1,114,111 (1 to 0x10FFFF), but as of June 2015 it only assigns 120,737 of those code points to actual glyphs.

UTF-8 - a standard which is basically a means to encode a stream of numeric values with a range from 1 to 2,147,483,647 (1 to 0x7FFFFFFF), using a variable number of bytes (1-6) to represent each value such that numerically smaller values are represented with fewer bytes.  For example, the number 17 requires a single byte to represent, whereas the number 2,000,000,000 requires 6 bytes.  Notice that the largest Unicode code point only requires 4 bytes to represent.


So the Unicode standard specifies that a particular numeric value corresponds to a specific printed character form, and UTF-8 specifies a way to encode those numeric values.  Although the UTF-8 encoding scheme could theoretically be used to encode numbers for any purpose, it was designed to encode Unicode characters reasonably efficiently.  The efficiency derives from the fact that Unicode biases the most-frequently used characters to smaller numeric values.

UTF-8 and Unicode were designed to be backward compatible with 7-bit ASCII, in that the "printable" ASCII characters have numeric values equal to the first 127 Unicode characters (1-127), and UTF-8 can represent those values with a single byte.  Thus the string "ABC" is represented in ASCII as 3 bytes with values 0x41, 0x42, 0x43, and those same three bytes are the valid UTF-8 encoding of the same string in Unicode.  Thus, an application designed to read UTF-8 input is able to read plain ASCII text.  And an application which only understands 7-bit ASCII can read a UTF-8 file which restricts itself to the first 127 Unicode characters.

Another nice thing about UTF-8 is that the bytes of a multi-byte character cannot be confused with normal ASCII characters; every byte of a multi-byte character has the most-significant bit set.  For example, the tilda-n character "ñ" has a Unicode code point 241, and the UTF-8 encoding of 241 is the two-byte sequence  0xC3, 0xB1.  It is also easy to differentiate the first byte of a multi-byte character from subsequent bytes.  Thus, if you pick a random byte in a UTF-8 buffer, it is easy to detect whether the byte is part of a mutli-byte character, and easy to find that character's first byte, or to move past it to the next character.

One thing to notice about UTF-8 is that it is trivially easy to contrive input which is illegal and will not properly parse.  For example, the bytes 0xC3, 0x41 is an illegal sequence in UTF-8 (0xC3 introduces a 2-byte character, and all bytes in a multi-byte character *must* have the most-significant byte set).


Other Encoding Schemes

There are other Unicode encoding schemes, such as UCS-2 and UTF-16, but their usage is declining.  UCS-2 cannot represent all characters of the current Unicode standard, and UTF-16 suffers from problems of ambiguous endianness (byte ordering).  Neither is backward compatible with ASCII text.

Another common non-Unicode-based encoding scheme is ISO-8859-1.  It's advantage is that all characters are represented in a single byte.  It's disadvantage is that it only covers a small fraction of the worlds languages.  It is backward compatible with ASCII text, but it is *not* compatible with UTF-8.  For example, the byte sequence 0xC3, 0x41 is a perfectly valid ISO-8859-1 sequence ("ÃA") but is illegal in UTF-8.  According to Wikipedia, ISO-8859-1 usage has been declining since 2006 while UTF-8 has been increasing.

There are a bunch of other encoding schemes, most of which are variations on the ISO-8859-1 standard, but they represent a small installed base and are not growing nearly as fast as UTF-8.

Unfortunately, there is not a reliable way to detect the encoding scheme being used simply by examining the input.  The input data either needs to be paired with metadata which identifies the encoding (like a mime header), or the user simply has to know what he has and what his software expects.

Most Unixes have available a program named "iconv" which will convert files of pretty much any encoding scheme to any other.  The user is responsible for telling "iconv" what format the input file has.


Programming with Unicode Data

Java and C# have significant features which allow them to process Unicode strings fairly well, but not perfectly.  The Java "char" type is 16 bits, which at the time Java was being defined, was adequate to hold Unicode.  But Unicode evolved to cover more writing systems and 16 bits is no longer adequate, so Java now supports UTF-16 which encodes a Unicode character in either 1 or 2 of those 16-bit chars.  Not being much of a Java or C# programmer, I can't say much more about them.

In C, Unicode is not handled natively at all.

A programmer needs to decide an encoding scheme for in-memory storage of text.  One disadvantage to using something like UTF-8 is that the number of bytes per character varies, making it difficult to randomly access characters by their offset.  If you want the 600th character, you have to start at the beginning and parse your way forward.  Thus, random access is O(n) time instead of O(constant) time that usually accompanies arrays.

One approach that evolved a while ago was the use of a "wide character", with the type "wchar_t".  This would allow you to declare an array of "wchar_t" and be able to randomly access it in O(constant) time.  In earlier days, it was thought that Unicode could live within the range 1-65535, so the original "wchar_t" was 16 bits.  Some compilers still have "wchar_t" as 16 bits (most notably Microsoft Visual Studio).  Other compilers have "wchar_t" as 32 bits (most notably gcc), which makes them a candidate for use with full unicode.

Most recent advice I've seen tells programmers to avoid the use of "wchar_t" due to its portability problems and instead use a fixed 32-bit type, like "uint32_t", which sadly did not exist in Windows until Visual Studio 2010, so you still need annoying conditional compiles to make your code truly portable.  Also, an advantage of wchar_t over uint32_t is the availability of wide flavors of many standard C string handling functions (standardized in C99).

Other opinionated programmers have advised against the use of wide characters altogether, claiming that constant time lookup is vastly overrated since most text processing software spends most of its time stepping through text one character at a time.  The use of UTF-8 allows easy movement across multi-byte characters just by looking at the upper bits of each byte.  Also, the library libiconv provides an API to do the conversions that the "iconv" command does.

And yet, I can understand the attraction of wide (32-bit) characters.  Imagine I have a large code base which does string manipulation.  The hope is that by changing the type from char to wchar_t (or uint32_t), my for loops and comparisons will "just work".  However, I've seen tons of code which assumes that a character is 1 byte (e.g. it will malloc only the number of characters without multiplying by the sizeof the type), so the chances seem small of any significant code "just working" after changing char to wchar_t or uint32_t.

Finally, note that UTF-8 is compatible with many standard C string functions because null can be safely used to indicate the end of string (some other encoding schemes can have null bytes sprinkled throughout the text).  However, note that the function strchr() is *not* generally UTF-8 compatible since it assumes that every character is a char.  But the function strstr() *is* compatible with UTF-8 (so long as *both* string parameters are encoded with UTF-8).


Bottom Line: No Free (or even cheap) Lunch

Unfortunately, there is no easy path.  If I am ever tasked with writing international software, I suspect I will bite the bullet and choose UTF-8 as my internal format.

Thursday, June 4, 2015

C Pointers: Never Too Old to Learn

I've been a C programmer for some 20 years now, and I learned something new about the language just a few days ago.

A less-experienced C coder sent me a question about pointers.  It was in an area which I have seen proficient C coders stumble on.  In my reply, I went on a few digressions to give a better understanding of C pointers.  At one point, I gave the following example:

    char *x1 = "123\n";
    char x2[] = "456\n";
    printf("%p %p", &x1, &x2);

&x1 is obvious - it evaluates to the address of the x1 variable (not the string it points at).  I was all ready to proudly claim that &x2 is illegal and generates a syntax error.  After all, x2 is an array, and referencing an array without an index evaluates to the address of the start of the array.  What does it mean to take the address of an address?    For that to make sense, the address of the array would need to be stored in a pointer variable.  But there *is* no pointer variable holding the address of the x2 array, that address is just an ephemeral value generally held in a register.  So &x2 should be illegal.

But it complied clean, not even a warning!  I got out my trusty K&R 2nd edition (ANSI C) and it says, "The operand [to the & operator] must be an lvalue referring neither to a bit-field nor to an object declared as register, or must be of function type."  Must be an lvalue, and x2 by itself is NOT an lvalue!  (An lvalue can be assigned to: lvalue = rvalue;)  Let's try the C99 spec: "The operand of the unary & operator shall be either a function designator, the result of a [] or unary * operator, or an lvalue that designates an object that is not a bit-field and is not declared with the register storage-class specifier." Different wording, but still specifies an lvalue.

Finally, google led me to a stackoverflow where somebody asked the same question.  The interesting reply:
In C, when you used the name of an array in an expression (including passing it to a function), unless it is the operand of the address-of (&) operator or the sizeof operator, it decays to a pointer to its first element.
That is, in most contexts array is equivalent to &array[0] in both type and value.
In your example, my_array has type char[100] which decays to a char* when you pass it to printf.
&my_array has type char (*)[100] (pointer to array of 100 char). As it is the operand to &, this is one of the cases that my_array doesn't immediately decay to a pointer to its first element.
The pointer to the array has the same address value as a pointer to the first element of the array as an array object is just a contiguous sequence of its elements, but a pointer to an array has a different type to a pointer to an element of that array. This is important when you do pointer arithmetic on the two types of pointer.
(Thanks to Charles Bailey for that answer back in 2010.)

Sure enough, x2 and &x2 both evaluate to the same pointer value, but x2+1 adds 1 to the address, and &x2+1 adds 5 to the address.

I can't find this mentioned in the formal C language specs I checked, but my experiments with the excellent site http://gcc.godbolt.org suggest that it is commonly implemented.  A less-formal C book also mentions it.


C Pointers V.S. Arrays

So much for my newly-learned bit of C.  Since I'm on the subject, here's a rewrite of the digression that triggered my investigation.

What does the following line do?

    char *x1 = "123\n";

The quoted string causes C to allocate an anonymous 5-byte character array in memory and initializes it (at program load time) with 31 32 33 1a 00 (hex).  I call it "anonymous" because there is no identifier (variable name) which directly refers to that array.  All you have is the address of the array, which is what a quoted string evaluates to in an expression.  That address is assigned into a pointer variable, x1.

So far, so good.  Now:

    char x2[] = "456\n";

This does *NOT* allocate an anonymous character array.  It allocates a 5-character array and assigns the variable name"x2" and pre-initializes it with 34 35 36 1a 00 (hex).  So the x2 variable (an array) is completely different from x1 (a pointer).

However, you can use the two variables in the same ways.  For example, this will do what you expect:

    printf("%s%s", x1, x2);  /* prints two lines, "123" and "456" */

It is easy to think of x1 and x2 as more the same than they really are, but I think it helps to analyze it.  Parameters passed to a function are expressions.  The values actually passed are the results of evaluating those expressions.  The above printf function is passed three expressions: "%s%s", x1, and x2.  Here's how they are handled by C:

  • "%s%s" - as before, the quoted string causes C to allocate an anonymous character array and inits it with the supplied characters.  The quoted string evaluates to the address of that array, and that address is passed as the first parameter.
  • x1 - x1 is a simple variable, so the expression evaluates to the content of the variable.  In this case, that content happens to be the address of the string "123\n".
  • x2 - x2 is *not* a simple variable, it is an array.  Normally, an array should be accessed with an index (e.g. x2[1]).  However, when an unindexed array name appears in an expression, C returns the address of the start of the array.  I.e. x2 is the same as &x2[0].  So x2 passes the address of the string "456\n".  But understand the mechanism is completely different; x1 passes the *content* of the variable x1 while x2 passes the *address* of the variable x2.

Similarly, consider this:

    printf("%c%c\n", x1[1], x2[1]);  /* print "25" */

Here again, the results are similar, but the mechanism is different.

  • x1[1] - x1 is *not* an array, so what does x1[1] mean?  It depends on x1's type.  In this case, it is a character pointer, so C takes the address contained in x1 and adds 1*sizeof(*x1) to it, and fetches the character stored at that address.  Since x1 points to type character, sizeof(*x1) evaluates to 1.  So the expression x1[1] evaluates to the character stored at the address in x1 plus 1, which is the character '2'.
  • x2[1] - this is very simple.  Since x2 is an array, x2[1] simply evaluates to the contents of the second element in the array, which is the character '5'.  Note that C does not have to do any pointer arithmetic, it just accessed the array directly.

In the above two examples, the pointer and the array can be thought of similarly.  This true most of the time, and often leads to people forgetting that they are actually different things.  For example:

    printf("%d %d\n", sizeof(x1), sizeof(x2));

This prints "4 5" on a 32-bit machine and "8 5" on a 64-bit machine.  The first number should be clear: x1 is a pointer, so sizeof(x1) should be the number of bytes in a pointer (4 or 8 depending on address length).  But what about the 5?  This is an interesting deviation from the patterns established previously.  If we followed the previous pattern, sizeof(x2) should deconstruct as x2 evaluating to the address of the start of the x2 array.  Then sizeof(x2) should return the size of the address.  However, sizeof is *not* a normal C function, it's a language built-in operator (the parentheses are optional).  It's operand is *not* interpreted as an expression, it's a type.  So sizeof(x1) is not actually the size of the x1 variable, it is the size of the *type* of x1.  Since the type of x2 is an array of 5 characters, sizeof(x2) evaluates to 5.

One more example of x1 and x2 being treated differently, which is the thing I just learned:

    if (x1 == &x1) printf("x1==&x1\n");
    if (x2 == &x2) printf("x2==&x2\n");

The output of the above two lines is the single line "x2==&x2".  Note that this code also generates a warning on the second line since x2 and &x2 are pointers to different types (pointer to a byte v.s. pointer to an array of 5 bytes).

The moral?  Although the mechanisms are different, an array and a pointer to an array can *usually* be treated the same in normal code.  But not always.  It's best to have a deep understanding of what's going on.

Saturday, February 14, 2015

Rapid Cycle Testing with C++

I discovered the convenience of rapid cycle testing in Clojure with "midje" and liked it.  I then re-created it while learning some Ruby.  Now I'm refreshing my C++ skills and wanted the same thing, and decided to combine it with GoogleTest (gtest) and clang on my Mac.

As usual when combining tools, it took a bit of fiddling to get it all playing nicely.  In hopes of saving somebody else some time, I uploaded a sample project to GitHub.

See https://github.com/fordsfords/autotest-cpp

NOTE: has dependencies on other tools; see README.md for details.

I'll probably change my workflow a lot over time, but that minimal example seems to work pretty well.  Run ./autotest.sh in one window and an editor in another.  Every time you write a source file, the automated test runs.  For example:

$ ./autotest.sh 

Now I write "learn.cpp" out, and the autotest window spits out:

autotest: /Users/sford_itunes/drop_sford/Dropbox/sford/src/c++11/autotest-cpp/learn.cpp

[ 33%] Built target gtest
[ 66%] Built target gtest_main
Scanning dependencies of target runUnitTests
[100%] Building CXX object CMakeFiles/runUnitTests.dir/learn.cpp.o
Linking CXX executable runUnitTests
[100%] Built target runUnitTests
        1.77 real         1.40 user         0.19 sys

Running main() from gtest_main.cc
[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from CoutTests
[ RUN      ] CoutTests.Simple
Hello world
msg2
[       OK ] CoutTests.Simple (0 ms)
[----------] 1 test from CoutTests (0 ms total)

[----------] 1 test from VarInits
[ RUN      ] VarInits.Simple
[       OK ] VarInits.Simple (0 ms)
[----------] 1 test from VarInits (0 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (1 ms total)
[  PASSED  ] 2 tests.
        0.00 real         0.00 user         0.00 sys
tst Sun Feb 15 14:28:25 CET 2015

        1.79 real         1.40 user         0.20 sys

It's a little verbose; I'll probably tighten it up a bit.  Note that it takes almost 2 seconds, most of which is in the build step.  That's on a second generation Macbook Air.  I miss the sub-second cycle time of Clojure Midje and Ruby autotest, but oh well.

By the way, if you download the project from GitHub, you will note that there is no CMakeLists.txt file, and you'll wonder how it can be using cmake.  Just follow the readme instructions and run the ./rebuild.sh script.  That script creates the CMakeLists.txt file.  (It seemed like a good idea at the time...)

Friday, February 13, 2015

Go To Not So Harmful After All

As one who lived through the final throes "Go To Statement Considered Harmful" debate (I still feel like I need a shower on the rare occasions that I use goto), I find "An Empirical Study of goto in C Code" interesting.

(Credit to slashdot for the find.)

 

Edit: just stumbled on https://koblents.com/Ches/Links/Month-Mar-2013/20-Using-Goto-in-Linux-Kernel-Code/

I've been skimming some Linux kernel code, and I'm not sure they got it right, but I tend to agree in principle that gotos can make code more readable and maintainable, especially for languages that don't support exceptions.

Wednesday, February 11, 2015

How to improve your digital audio?

Have an extra $2194 and would prefer to give it to greedy liars instead of worthy charities?  Then buy yourself one of these 10-foot Ethernet cables.  According to the description:
Even finest high-purity metals have imperfections and grain structure, which due to the direction of casting and drawing become non-symmetrical structures without the conductor. In ways which are not fully understood, both analog and digital audio performance is significantly affected by these non-linearities.
Oh, I fully understand it.  Gullible people can be talked into anything.

Saturday, December 13, 2014

Rapid Cycle Testing

I'm still experimenting with TDD, and the  jury  is  still  out.  As with so many religious wars, I suspect the best advice is to strive for balance.

But one thing that I'm completely sold on is Rapid Cycle Testing.  (Not to be confused with Short Time Horizon Designing, which I'm *not* sold on.)  To the degree that you *do* write automated tests, they should be executed frequently.  How frequently?  Well...
  • How about once per release cycle, right before code freeze?
  • How about once a week?
  • How about once a day, as part of the "nightly" build?
  • How about several times a day, as part of each developer build?
or
  • How about every time you save a source file from your editor?
Working with Clojure and Midje's autotest has sold me on that last one.  It is only practical for simple tests which execute very quickly, but so far that is what most of my unit tests have turned out to be.


Ruby

Being somebody who enjoys learning new things, I decided to do some Ruby learning.  When I learn a new thing, I like to experiment and take notes.  I have files named "learning_lisp.txt" and more recently "learning_clojure.txt".  These files have small examples of source code with minimal explanation; they aren't intended to be useful for somebody else, but only to trigger my memory.  In particular, they show the results of my experiments.

The other day I saw this post, in which the author describes writing unit tests as he learns Ruby.   This made complete sense to me, and I wish I had seen that post a year ago!  Now that I'm learning Ruby, my notes file is "learning_ruby.rb" ... ".rb" instead of ".txt".  My experiments are entered in there instead of just typed interactively in a REPL.  I've only just started, and I'm not sure if I'll stick with it, but I really like the idea of having my notes file be executable; if nothing else, it catches my typos.

Unfortunately, Ruby doesn't have a Midje-like autotest that I know of, so I wrote my own.  I'm sure this will need to evolve significantly over time, especially as I work on larger projects, but here's my "start small" approach.  Note that this is Unix-centric (I use it on my Mac):

    fswatch  -i "\.rb$" -e ".*" . | xargs -n1 ./test1.sh

where "test1.sh" contains little more than:

    echo "========================================="
    ruby $*
    date

Note, I don't think fswatch comes with Mac out of the box.  I used Homebrew's:

    brew install fswatch

I love this setup!  I have my "learning_ruby.rb" file in my editor and my Ruby book open.  I see a new function, and I enter a couple of experiments in my ".rb" file and save it.  The "fswatch" window immediately runs it and shows me the results.  Talk about instant gratification!

Thursday, December 11, 2014

Wanna know who *really* landed a man on the moon?


A woman named Margaret.

https://medium.com/@3fingeredfox/margaret-hamilton-lead-software-engineer-project-apollo-158754170da8

More info on Wikipedia.  Excerpt:
Hamilton's work prevented an abort of the Apollo 11 moon landing: Three minutes before the Lunar lander reached the Moon's surface, several computer alarms were triggered. The computer was overloaded with incoming data, because the rendezvous radar system (not necessary for landing) updated an involuntary counter in the computer, which stole cycles from the computer. Due to its robust architecture, the computer was able to keep running; the Apollo onboard flight software was developed using an asynchronous executive so that higher priority jobs (important for landing) could interrupt lower priority jobs. 
Smart engineer.

Sunday, December 7, 2014

Short Time Horizon Designing

In my previous blog post, I talked about different data representations for recording a game of 10-pin bowling.  I started with an obvious representation, and moved toward a more clever one.

I want to admit that it took me quite a while to write the bowling code.  Part of the reason is that I am very simply *not* a super fast coder (hence my blog title).  Never have been.
<digression selfserving="1">Fortunately, my lack of blinding coding speed has not been an impediment to my career.  In every job I've had, the actual time spent coding is small compared the the time spent doing all the other things that a programmer does.  So if I'm slower at 15% of my job, and just as fast (or faster) at the other parts, my overall productivity is on par with the coding cowboys.  If you need a code jumper, I'm probably not your guy.  If you need code that will stand the test of time, I probably am.</digression>
Anyway, when I started writing the bowling code, I tried using a TDD approach.  One common practice with TDD is to not try to design an ideal system up front, but to only code the minimum required to satisfy the current test set.  I call this the "short time horizon" approach to design.  I've heard Agilists also argue in favor of a short time horizon -- only design your current iteration's software, and if that design turns out to be inadequate for future stories, refactor as needed.

The idea is that by not over-thinking the task at hand, you can do it ... (insert ominous music here) ... faster.  Get something working now.  Worry about the future when it arrives.

So, when I started writing the bowling code, I only looked ahead to the currently-written failing test.  Being an old C programmer, I started with an array of 10 "frame" structures, each with 3 "rolls".  It wasn't until late in the cycle that I included a test case with two strikes in a row, causing me to realize the difficulty in looking forward 2 rolls in a frame-based structure.  I changed the data representation in a fundamental-enough way that the vast majority of the code had to be modified / re-written.

No problem, right?  TDD and Agile is all about refactoring, right?

The danger is that time pressures will result in less refactoring, and more hacking and patching, leading to brittle software that takes a long time to maintain and enhance.  It's always cheaper to hack in this one little feature *without* the refactoring, so the refactor rarely gets done and you end up with a mess.  Ultimately, customers suffer from buggy software releases due to unmaintainable code.

Mind you, I don't blame "management" for this!  (At least, I don't *only* blame management.)  Any time I've gone to my manager -- be it project, product, or administrative -- and said, "It's time already!  We *need* to refactor this code," they've always made time in the schedule.  Sometimes, we've had to debate the definition of "need", but when I've pushed hard enough, they've always agreed.  It's the programmers themselves who push back just as often, sometimes out of an assumption that management wouldn't agree, other times because they're just sick and tired of the code, and want to move on.  With this bowling project, I was programming for fun, so I didn't mind re-writing it.  But it still would have saved me a lot of time had I spent 15 minutes or so thinking and prototyping before deciding.

In the bad old pure-waterfall days, we tried to get the approach right during one marathon design phase, before line 1 of code was written.  It resulted in low overall productivity, and still didn't eliminate the need to refactor.  I firmly believe that the rapid cycle encouraged by Agile and TDD is a good idea, but yesterday I let the pendulum swing too far the other way.  I should have been more meditative up front.  :-)  And I guess that's my point here: balance.  Don't design to death, but also don't let TDD and Agile be the death of design.

---

Update: Who knew there was a whole movement around slow programming?  (Thanks John!)

Saturday, December 6, 2014

Bowling for Aardvarks

Why Aardvarks?  Because in the 80s I knew somebody who made up the fictitious game show as a quintessentially absurdist entertainment.  He is the same guy who made up the word "blunjo" simply because it made him laugh to say it.  I enjoyed his company, and I'll be damned if I can remember his name!  Maybe someday he will google "blunjo" or "bowling for aardvarks" and get in touch with me.

But that's not what I wanted to talk about.  I want to meditate a bit about a data representation conundrum: obvious v.s. clever.  I normally avoid cleverness in favor of obviousness.  But sometimes it's the wrong choice.

I recently wrote some scoring code for bowling (10 pin), and it got me thinking about representing data.  To refresh your memory, here's a typical score card (courtesy of google images):


I won't explain all the rules, Wikipedia does that, but Ms. Lane here rolled twice in frame 1, knocking down 9 pins and then 1 for a spare.  In the second frame, she only rolled once, knocking down all 10 for a strike.  Etc.


The Obvious Representation

If we want to represent a game's scoring in software, how might one do it?  An old C programmer like me would immediately think about an array of 10 "frame" structures:
    struct frame_s {
        int roll[2];
        int c_score;    /* cumulative */
    }
    struct game {
        char name[80];
        struct frame_s frame[10];
    }
(Full disclosure: I actually did all this in Clojure, with nested vectors rather than structures, but the issues I talk about apply equally to both languages.  I'll stick with C for this post.)

But wait!  The 10th frame can have up to three rolls.  How to represent that?  Given the structure of the paper scoring card, this seems the most obvious and direct representation:
    struct frame_s {    /* frames 1-9 */
        int roll[2];
        int c_score;    /* cumulative */
    }
    struct frame_10_s { /* frame 10 is special */
        int roll[3];
        int c_score;    /* cumulative */
    }
    struct game_s {
        char name[80];
        struct frame_s frame[9];
        struct frame_10_s frame_10;
    }
Is this the right approach?  Everybody who bowls knows what a score card looks like, and treating frame 10 as special is arguably the most obvious and least astonishing way to represent a score card ...  to a bowler.

But who is going to be looking at those data structures?  Programmers.  And I bet that even programmers who bowl would turn a bit pale at the "struct frame_10_s" approach; it would lead to complex code, with lots of "if (frame_num == 10)" statements and *almost* duplicated code.  Hard to get right, hard to maintain.

So let's instead go with 10 copies of "struct_frame_s" containing "int roll[3]" on the assumption it will simplify the code (it does).


Eliminate Cumulative Score

Does anything else stand out?  What about "int c_score", the cumulative score?  It seemed natural given the structure of a paper score sheet, but it's just duplicate data.  You can derive the cumulative score for any given frame by starting at frame 1 and working your way forward.  Sometimes, old C programmers tend shy away from that approach for efficiency reasons.  "OMG, that makes scoring a game O(n**2)!"

But duplicate data also introduces lots of opportunities for bugs.  I would rather start with the more simple and safe approach, and only optimize if subsequent profiling indicates that it is needed.

Let's eliminate "c_score".  Here's our structure:
    struct frame_s {
        int roll[3];
    }
    struct game {
        char name[80];
        struct frame_s frame[10];
    }


A Very Different Representation

So, are we done?  Let's think about the scoring algorithm.  When you get a strike, you add the next two rolls.  Look at Ms. Lane's game.  In frame 2 she got a strike, so she took the 10 for those 10 pins, added the two roles in frame 3 (7 and 2).  No problem.  But what if she had gotten a strike in frame 3?  I.e. two strikes in a row.  In that case, frame 2's score would have to access both frame 3 *and* frame 4.  For a strike, the number of forward frames to access isn't constant, and requires some messy if-then-else code.

Let's try a different representation entirely:
    struct game_s {
        char name[80];
        int roll[21];
    }
We don't divide the rolls up by frame, we just store the rolls sequentially.  The maximum number of rolls in a game is 21 (no strikes in frames 1-9, and a strike or spare in frame 10).  The minimum number of rolls is 11 (strikes in 1-9, and neither strike nor spare in frame 10).

This makes it much easier to calculate a strike's score; you just look at the next two rolls.  In fact, just to make the algorithm even easier, you could use "int roll[23]" with the extra two rolls initialized to zero.  That way, a 21-roll game that ends with a strike can use the same algorithm, which adds the next two rolls.  It doesn't need special case code for frame 10.

So, is this the right choice?  It makes it easier to do the look-ahead calculations for strikes.  But it also makes it non-trivial to find the frame boundaries when you need to, like during display.  And it moves the data representation *very* far from the application's natural visualization of the data (the paper score card).

I've concluded that this approach is better.  I've noodled around with the code, and it's pretty straight-forward to write simple methods to walk the array and identify the frames.  The overall code was easy to write, easy to test, and I think it is easy to read.  I've achieved software simplicity by deviating from the obvious data representation in favor of a more clever one.  I must admit to being a little uneasy arriving here; "clever" is almost a dirty word in my lexicon.  :-)

Friday, October 24, 2014

Event logger

A long time ago, I mentioned an event logger that I promised I would create a downloadable package for.  I obviously never did.

I've had a couple of opportunities to recommend it recently, so I decided to accept the fact I may never get it a nice neat package, and I should just describe it.

The idea is that in multi-threading and/or distributed systems, it is often not practical to debug problems by adding debug logging.  Sometimes logging changes the timing of the code too much (either masking the problem you're tracking, or even causing new problems due to slowness).  Another problem with logging debug output is that often it takes millions of loops before a problem reproduces, and nobody wants to crawl through gigabytes of log files.  (I can't tell you how many times we've filled disks up with debug logs trying to catch hard-to-reproduce problems.)

So my event logger is intended to be a VERY low-impact debugging aid.  It is a bit of a pain to use, so it is often more of a last resort, but sometimes it's the only way to track a tricky race condition.

Here's a minimal version of it:

/* globals */
#define EVLOG_SIZE 65536                      /* power of 2 */
volatile unsigned long evlog[EVLOG_SIZE];
volatile unsigned long evlog_num_logs = 0;

/* The ORing constant 0x80000000 is there to give a hint that the entry contains a line number. */
#define EVLOG0 do { evlog[ (evlog_num_logs++) & (EVLOG_SIZE-1) ] = 0x80000000 | __LINE__; } while (0)
/* EVLOG1 should only be used immediately after EVLOG0 (like on the same line) */
#define EVLOG1(val) do { evlog[ (evlog_num_logs++) & (EVLOG_SIZE-1) ] = val; } while (0)

Now inside your code, you can sprinkle calls to EVLOG0 and EVLOG1:

    EVLOG0;
    s = socket(...);
    EVLOG0;  EVLOG1(s);
...

    EVLOG0;
    close(s);
    EVLOG0;  EVLOG1(s);

So, the above will create a nice log of events in the global evlog array.  What do you do with it?  One thing is that by making it evlog and evlog_num_logs globals, you can find them with gdb if your program crashes and dumps core.  Also, if your code contains sanity tests and calls some kind of fatal error function, that function can write the contents of those globals to a debug dump file.  Finally, you can even add a user interface to trigger dumping the globals to a file.

Note that although I use this code for multithreaded debugging all the time, it is technically not thread-safe.  The evlog_num_logs++ is not atomic, for one thing, which means that if two CPUs execute it simultaneously, the variable will only be incremented once.  However, given that the primary goal of this event logger is to minimize impact, it is usually better to live with the risk of a mildly-corrupted event log.

I've used this technique in the past, and there is lots of room for improvement.  For example, expand what you save with each event: thread ID, high-resolution time stamp, source file name (to go with line number), etc.  But be careful - add too much and you deviate from the "minimal impact" goal.

One final thought - many products have a "debug mode" in which lots of debug output is spewed to a debug file.  I've already mentioned some problems with this -- changing timing can mask or introduce problems and too much output -- but there's a much bigger problem: debug mode is never there when you need it.  Customer calls you up and says, "a bad thing happened."  You say, "turn on debug mode."  Customer does, and is not able to reproduce the problem.  With this kind of low-impact event logging, you can leave it on all the time.  Have it auto-dump on crashes, and have a customer-triggerable dump also.  You may not be logging the right events, but at least you'll have *some* evidence to work with ... better than "a bad thing happened."