Saturday, December 6, 2014

Bowling for Aardvarks

Why Aardvarks?  Because in the 80s I knew somebody who made up the fictitious game show as a quintessentially absurdist entertainment.  He is the same guy who made up the word "blunjo" simply because it made him laugh to say it.  I enjoyed his company, and I'll be damned if I can remember his name!  Maybe someday he will google "blunjo" or "bowling for aardvarks" and get in touch with me.

But that's not what I wanted to talk about.  I want to meditate a bit about a data representation conundrum: obvious v.s. clever.  I normally avoid cleverness in favor of obviousness.  But sometimes it's the wrong choice.

I recently wrote some scoring code for bowling (10 pin), and it got me thinking about representing data.  To refresh your memory, here's a typical score card (courtesy of google images):


I won't explain all the rules, Wikipedia does that, but Ms. Lane here rolled twice in frame 1, knocking down 9 pins and then 1 for a spare.  In the second frame, she only rolled once, knocking down all 10 for a strike.  Etc.


The Obvious Representation

If we want to represent a game's scoring in software, how might one do it?  An old C programmer like me would immediately think about an array of 10 "frame" structures:
    struct frame_s {
        int roll[2];
        int c_score;    /* cumulative */
    }
    struct game {
        char name[80];
        struct frame_s frame[10];
    }
(Full disclosure: I actually did all this in Clojure, with nested vectors rather than structures, but the issues I talk about apply equally to both languages.  I'll stick with C for this post.)

But wait!  The 10th frame can have up to three rolls.  How to represent that?  Given the structure of the paper scoring card, this seems the most obvious and direct representation:
    struct frame_s {    /* frames 1-9 */
        int roll[2];
        int c_score;    /* cumulative */
    }
    struct frame_10_s { /* frame 10 is special */
        int roll[3];
        int c_score;    /* cumulative */
    }
    struct game_s {
        char name[80];
        struct frame_s frame[9];
        struct frame_10_s frame_10;
    }
Is this the right approach?  Everybody who bowls knows what a score card looks like, and treating frame 10 as special is arguably the most obvious and least astonishing way to represent a score card ...  to a bowler.

But who is going to be looking at those data structures?  Programmers.  And I bet that even programmers who bowl would turn a bit pale at the "struct frame_10_s" approach; it would lead to complex code, with lots of "if (frame_num == 10)" statements and *almost* duplicated code.  Hard to get right, hard to maintain.

So let's instead go with 10 copies of "struct_frame_s" containing "int roll[3]" on the assumption it will simplify the code (it does).


Eliminate Cumulative Score

Does anything else stand out?  What about "int c_score", the cumulative score?  It seemed natural given the structure of a paper score sheet, but it's just duplicate data.  You can derive the cumulative score for any given frame by starting at frame 1 and working your way forward.  Sometimes, old C programmers tend shy away from that approach for efficiency reasons.  "OMG, that makes scoring a game O(n**2)!"

But duplicate data also introduces lots of opportunities for bugs.  I would rather start with the more simple and safe approach, and only optimize if subsequent profiling indicates that it is needed.

Let's eliminate "c_score".  Here's our structure:
    struct frame_s {
        int roll[3];
    }
    struct game {
        char name[80];
        struct frame_s frame[10];
    }


A Very Different Representation

So, are we done?  Let's think about the scoring algorithm.  When you get a strike, you add the next two rolls.  Look at Ms. Lane's game.  In frame 2 she got a strike, so she took the 10 for those 10 pins, added the two roles in frame 3 (7 and 2).  No problem.  But what if she had gotten a strike in frame 3?  I.e. two strikes in a row.  In that case, frame 2's score would have to access both frame 3 *and* frame 4.  For a strike, the number of forward frames to access isn't constant, and requires some messy if-then-else code.

Let's try a different representation entirely:
    struct game_s {
        char name[80];
        int roll[21];
    }
We don't divide the rolls up by frame, we just store the rolls sequentially.  The maximum number of rolls in a game is 21 (no strikes in frames 1-9, and a strike or spare in frame 10).  The minimum number of rolls is 11 (strikes in 1-9, and neither strike nor spare in frame 10).

This makes it much easier to calculate a strike's score; you just look at the next two rolls.  In fact, just to make the algorithm even easier, you could use "int roll[23]" with the extra two rolls initialized to zero.  That way, a 21-roll game that ends with a strike can use the same algorithm, which adds the next two rolls.  It doesn't need special case code for frame 10.

So, is this the right choice?  It makes it easier to do the look-ahead calculations for strikes.  But it also makes it non-trivial to find the frame boundaries when you need to, like during display.  And it moves the data representation *very* far from the application's natural visualization of the data (the paper score card).

I've concluded that this approach is better.  I've noodled around with the code, and it's pretty straight-forward to write simple methods to walk the array and identify the frames.  The overall code was easy to write, easy to test, and I think it is easy to read.  I've achieved software simplicity by deviating from the obvious data representation in favor of a more clever one.  I must admit to being a little uneasy arriving here; "clever" is almost a dirty word in my lexicon.  :-)

1 comment:

Steve Ford said...

FYI - I have a short followup to this post related to TDD and Agile.